ggml-metal: NULL-safety hardening for iOS Metal init failures by ishanvohra2 · Pull Request #10 · tetherto/qvac-ext-ggml

ishanvohra2 · 2026-05-18T14:12:13Z

Summary

Harden ggml-metal against transient Metal init failures on iOS. On iPhone 16 (A18 / G17P) running iOS 26.4, repeated warm relaunches of ggml-speech-based TTS engines (chatterbox via tts-cpp) intermittently crash inside Apple's AGXMetalG17P.bundle during ggml_metal_device_init() / ggml_metal_library_init() with EXC_BAD_ACCESS / KERN_INVALID_ADDRESS at offset 0x0 or 0x10. Root cause in all signatures we've seen is the same: a Metal init step (MTLCreateSystemDefaultDevice, library load, buffer alloc) returns nil/NULL but ggml-metal keeps the half-constructed ggml_metal_device alive, caches it in the static device list, and dereferences its fields seconds later from a different thread.

Two small commits, both pure NULL-safety hardening — no behavior change on the happy path:

1. `ggml-metal: NULL-guard ggml_metal_buffer_is_shared`

ggml_metal_buffer_init() can return NULL when MTLDevice newBufferWithLength: fails (Metal heap exhaustion, transient OS eviction on iOS warm relaunches). The standard caller pattern in ggml-metal.cpp does:

ggml_metal_buffer_t res = ggml_metal_buffer_init(ctx_dev, size, shared);
ggml_backend_buffer_i buf_i = ggml_metal_buffer_is_shared(res) // <-- deref NULL
    ? ggml_backend_metal_buffer_shared_i
    : ggml_backend_metal_buffer_private_i;

Add an explicit NULL check so the standard "allocator returned NULL" error path doesn't turn into a SIGSEGV at 0x10 (offset of is_shared in the struct).

2. `ggml-metal: propagate NULL through device init / backend reg / backend init`

Make the whole Metal init chain refuse to return a half-initialized device:

ggml_metal_device_init() — early-return NULL (and free(dev)) when MTLCreateSystemDefaultDevice is nil, and tear down the command queue + device handle when ggml_metal_library_init returns NULL. Previously the library failure was logged but the partially-built struct was still returned, crashing downstream the first time anything touched dev->library->....
ggml_metal_device_get_obj / _get_queue / _get_library — make the accessors NULL-safe so a caller holding a NULL device pointer can't turn it into a deep stack SIGSEGV.
ggml_metal_device_get() — when ggml_metal_device_init() returns NULL, don't emplace_back it into the static device list; otherwise the next call would hand back a stale dangling entry.
ggml_backend_metal_init() — explicitly handle "no Metal devices registered" / "device context is NULL" and return NULL so callers can fall back to CPU instead of crashing.
ggml_backend_metal_device_init() — return NULL when device context creation failed, instead of constructing a ggml_backend_device with a NULL context.
ggml_backend_metal_reg() — skip devices whose init returned NULL when populating the registry.

Effect

After these changes, Metal init failures surface as a clean

ggml_backend_metal_init: error: Metal device init failed - falling back to CPU

log line, and the caller sees a NULL backend instead of taking down the process. The user-visible OOM path (s3gen weights buffer alloc failure) is unchanged — this only converts previously-silent NULL deref crashes into reportable errors that the consumer (e.g. tts-cpp / tts-ggml) can surface to the SDK.

Test plan

Diff inspected; no behavior change on the happy path (all new code is on NULL/nil branches that previously crashed).
Compiles for macOS (host) and iOS arm64 (tts-ggml prebuild step).
Re-run tts-ggml chatterbox tests on iPhone 16 (iOS 26.4) with useGPU: true — expect a clean Metal device init failed log + CPU fallback (or clean error to SDK) on the runs that previously crashed during bootstrap.
Re-run on a Mac host where Metal succeeds to confirm zero regressions on the happy path.

ggml_metal_buffer_init() can return NULL when the underlying MTLDevice newBufferWithLength: call fails (Metal heap exhaustion, transient OS eviction on iOS warm relaunches, etc.). Callers that immediately probe the result via ggml_metal_buffer_is_shared(res) previously dereferenced NULL one line later through buf->is_shared, crashing with EXC_BAD_ACCESS / KERN_INVALID_ADDRESS at offset 0x10. Add a defensive NULL check so the standard 'allocator returned NULL' error path doesn't turn into a SIGSEGV at the next call site.

…d init On iPhone 16 (A18 / G17P) running iOS 26.4, repeated warm relaunches of ggml-speech-based TTS engines occasionally crash inside Apple's Metal driver during ggml_metal_device_init() / ggml_metal_library_init() with EXC_BAD_ACCESS at offset 0x0. The crash signature varies (most often inside the AGXMetalG17P bundle), but the common shape is that one of the Metal init steps (MTLCreateSystemDefaultDevice, library load, residency-set setup) hands back nil/NULL and ggml-metal continues as if it had succeeded - keeping a half-constructed ggml_metal_device, caching it in the static device list, and dereferencing its fields seconds later during the first allocation. Make the whole Metal init chain refuse to return a half-initialized device: * ggml_metal_device_init(): - if MTLCreateSystemDefaultDevice returns nil, log and return NULL (instead of falling through with mtl_device == nil and later dereferencing dev->mtl_device). - if ggml_metal_library_init() returns NULL, release the already- created command queue and device, free the partially-built ggml_metal_device, and return NULL. Previously this was logged but the half-built struct was still returned, which crashed downstream the moment anything touched dev->library->... * ggml_metal_device_get_obj / _get_queue / _get_library: NULL-safe accessors so a caller that managed to obtain a NULL device pointer cannot turn it into a SIGSEGV three frames deep. * ggml_metal_device_get(): when ggml_metal_device_init() returns NULL, do NOT emplace the NULL into the static device list (otherwise the next call to ggml_metal_device_get(device) would return a stale dangling entry). * ggml_backend_metal_init(): explicitly handle the "no Metal devices registered" / "device context is NULL" cases and return NULL so the consumer code path can fall back to CPU instead of crashing. * ggml_backend_metal_device_init(): return NULL when the underlying device context could not be created, instead of constructing a ggml_backend_device with a NULL context. * ggml_backend_metal_reg(): skip devices whose init returned NULL when populating the registry, rather than caching a NULL device that every subsequent ggml_backend_reg_dev_get() would hand back. After these changes, Metal init failures surface as a clean "ggml_backend_metal_init: Metal device init failed" log line in device.log, and the caller sees a NULL backend instead of taking down the process. The user-visible OOM path (s3gen weights buffer alloc failure) is unchanged - this only converts the previously-silent NULL deref crashes into reportable errors.

GustavoA1604

Cursor flagged this, not sure if relevant:

ggml_metal_buffer_is_shared NULL guard is partial (commit 1)

Returning false for a NULL buffer prevents the SIGSEGV at 0x10, but the NULL res is still forwarded to the next line in the caller. The canonical pattern from the PR description is:

ggml_metal_buffer_t res = ggml_metal_buffer_init(ctx_dev, size, shared);
ggml_backend_buffer_i buf_i = ggml_metal_buffer_is_shared(res) // now safe
? ggml_backend_metal_buffer_shared_i
: ggml_backend_metal_buffer_private_i;
// res and buf_i are then used together — res is still NULL here
Please do a quick audit of every ggml_metal_buffer_init call site in ggml-metal.cpp and confirm each one already has an explicit if (res == NULL) { return NULL; } guard before reaching ggml_metal_buffer_is_shared. If any site does not, this fix just shifts the crash by one line.

Incomplete teardown on library init failure (commit 2)

The new cleanup in ggml_metal_device_init() when ggml_metal_library_init returns NULL only releases mtl_queue and mtl_device. However, a non-trivial amount of setup happens between queue creation and the library init call (residency sets, properties, etc.), so other resources allocated in that range are leaked on this error path. The cleaner fix is to call ggml_metal_device_free(dev) instead of open-coding the two releases — assuming device_free is safe to call on a partially-initialized struct (i.e., it NULL-guards fields like library before touching them). If it isn't safe today, that's worth a small follow-up fix, but the long-term shape should use the canonical teardown rather than manually freeing individual fields

@GustavoA1604

… & use canonical device teardown Address @GustavoA1604's review on tetherto#10: 1) NULL-guard ggml_metal_buffer_init / ggml_metal_buffer_map at every call site, not just inside ggml_metal_buffer_is_shared. The previous defensive guard inside ggml_metal_buffer_is_shared stopped the SIGSEGV at offset 0x10, but the NULL `res` was still forwarded to ggml_backend_buffer_init on the very next line, which would just shift the crash. Audit of every ggml_metal_buffer_init / ggml_metal_buffer_map call site in ggml-metal.cpp: - ggml_backend_metal_buffer_type_alloc_buffer (shared/private) - ggml_backend_metal_device_buffer_mapped (mapped host pointer) Both now early-return NULL when the underlying allocator returns NULL, so the failure surfaces as a NULL ggml_backend_buffer_t to the caller instead of a backend buffer wrapping a NULL context. The in-function ggml_metal_buffer_is_shared NULL guard is kept as defense-in-depth. 2) Replace open-coded teardown in ggml_metal_device_init() with ggml_metal_device_free(dev). The previous error path released only mtl_queue and mtl_device on ggml_metal_library_init() failure. ggml_metal_device_free() is already NULL-safe across every owned field (rsets, library, mtl_queue, mtl_device - all calloc()ed to zero in ggml_metal_device_init), so calling it on a partially-initialized struct is safe today and stays correct as new owned resources are added to ggml_metal_device or moved earlier in the init sequence. Same canonical teardown is now used for the MTLCreateSystemDefaultDevice == nil case for consistency. No behavior change on the happy path. Co-authored-by: Cursor <cursoragent@cursor.com>

ishanvohra2 · 2026-05-20T12:24:48Z

Thanks for the careful review @GustavoA1604 - both points were valid. Pushed as 6baf3baf:

1. ggml_metal_buffer_init call-site audit (commit 1 follow-up)

You're right - the in-function guard only stopped the immediate 0x10 deref, the NULL res would still flow into ggml_backend_buffer_init and shift the crash. Audited every call site in src/ggml-metal/ggml-metal.cpp:

ggml_backend_metal_buffer_type_alloc_buffer (shared / private path) - now if (res == NULL) return NULL; before constructing the backend buffer.
ggml_backend_metal_device_buffer_mapped (mapped host pointer path) - same guard added; ggml_metal_buffer_map has the same NULL-on-failure contract as ggml_metal_buffer_init.

Those are the only two sites that funnel ggml_metal_buffer_init / _map into ggml_backend_buffer_init. The remaining ggml_metal_buffer_is_shared references are all GGML_ASSERT(...) consistency checks on already-validated buffers, so they're not on a NULL-allocator code path.

I kept the in-function NULL guard inside ggml_metal_buffer_is_shared as defense-in-depth - it's a one-line check and means a future call site added without the early-return guard still won't deref 0x10.

2. Use ggml_metal_device_free(dev) for partial-init teardown (commit 2 follow-up)

Switched both error paths in ggml_metal_device_init() (the MTLCreateSystemDefaultDevice == nil case and the ggml_metal_library_init() == NULL case) to call ggml_metal_device_free(dev) instead of open-coding releases.

ggml_metal_device_free() is already safe on a partially-initialized struct: dev is calloc()ed in ggml_metal_device_init() (zero-init), and every cleanup path inside ggml_metal_device_free is NULL-guarded:

ggml_metal_rsets_free(dev->rsets) -> early-returns when rsets == NULL
ggml_metal_library_free(dev->library) -> early-returns when lib == NULL
mtl_queue / mtl_device releases are wrapped in if (...) guards

So we get the canonical teardown today and it stays correct if anything else gets added to ggml_metal_device or moved earlier in the init sequence.

Picks up the 3-commit NULL-safety hardening from tetherto/qvac-ext-ggml#10 (f21ff9a0 + b7dc01c7 + 0f54d9f7) cherry- picked on top of `tetherto:speech` HEAD via the GustavoA1604 fork's `speech-origin-plus-pr-10` branch. Android backend stack unchanged: GGML_CPU_ALL_VARIANTS=ON + GGML_CPU_REPACK=ON shape introduced in port-version 3 stays. Only the Apple Metal init path changes -- transient `MTLCreateSystemDefaultDevice` / `ggml_metal_library_init` / buffer- alloc failures now surface as a clean `ggml_backend_metal_init: error: Metal device init failed - falling back to CPU` log line + NULL return, instead of crashing inside AGXMetalG17P.bundle with EXC_BAD_ACCESS (observed on iPhone 16 / iOS 26.4 warm relaunches). Flip REPO back to `tetherto/qvac-ext-ggml` + bump REF/SHA512 once PR-10 merges into tetherto:speech upstream. Co-authored-by: Cursor <cursoragent@cursor.com>

Picks up the iOS Metal NULL-safety hardening from tetherto/qvac-ext-ggml#10 (PR-10's 3 commits f21ff9a0 + b7dc01c7 + 0f54d9f7) now merged into the `speech` branch as PR-10 merge 08d39f0c. Android backend stack unchanged: GGML_CPU_ALL_VARIANTS=ON + GGML_CPU_REPACK=ON shape introduced in port-version 3 stays. Only the Apple Metal init path changes -- transient `MTLCreateSystemDefaultDevice` / `ggml_metal_library_init` / buffer- alloc failures now surface as a clean `ggml_backend_metal_init: error: Metal device init failed - falling back to CPU` log line + NULL return, instead of crashing inside AGXMetalG17P.bundle with EXC_BAD_ACCESS (observed on iPhone 16 / iOS 26.4 warm relaunches). Co-authored-by: Cursor <cursoragent@cursor.com>

…roid dynamic backends) (#159) * ggml-speech: bump to port-version 4 (PR #10 iOS Metal NULL-safety) Picks up the iOS Metal NULL-safety hardening from tetherto/qvac-ext-ggml#10 (PR-10's 3 commits f21ff9a0 + b7dc01c7 + 0f54d9f7) now merged into the `speech` branch as PR-10 merge 08d39f0c. Android backend stack unchanged: GGML_CPU_ALL_VARIANTS=ON + GGML_CPU_REPACK=ON shape introduced in port-version 3 stays. Only the Apple Metal init path changes -- transient `MTLCreateSystemDefaultDevice` / `ggml_metal_library_init` / buffer- alloc failures now surface as a clean `ggml_backend_metal_init: error: Metal device init failed - falling back to CPU` log line + NULL return, instead of crashing inside AGXMetalG17P.bundle with EXC_BAD_ACCESS (observed on iPhone 16 / iOS 26.4 warm relaunches). Co-authored-by: Cursor <cursoragent@cursor.com> * tts-cpp: bump to 2026-05-20 + tighten ggml-speech to #4 Pulls in the Android dynamic-backend selection landed on tetherto/qvac-ext-lib-whisper.cpp@master via PR #29 (merge 60dc1504, content commit 907f3151): registry-only `init_gpu_backend()` (Adreno 700+ -> OpenCL, every other GPU -> Vulkan / Metal / CUDA), new EngineOptions::backends_dir / opencl_cache_dir fields wired through to `tts_cpp::detail::set_backends_directory` / `tts_cpp::detail::set_opencl_cache_dir`, and the GGML_BACKEND_DL=ON Android CMake defaults so `ggml_backend_load_all_from_path()` discovers per-arch CPU + Vulkan + OpenCL .so files installed alongside the bare module. Also tightens the ggml-speech dependency to `version>=: 2026-04-09#4` so consumers without an explicit constraint of their own pull ggml-speech port-version 4 (PR-10 iOS Metal NULL-safety) instead of the registry baseline #1. tts-cpp's new `init_gpu_backend()` walks the registry blindly and a NULL-from-Metal-init is now handled cleanly thanks to PR-10, so #4 is the lowest version this port is safe against. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>

QVAC SDK Pod added 2 commits May 18, 2026 19:36

GustavoA1604 requested changes May 18, 2026

View reviewed changes

ishanvohra2 requested a review from GustavoA1604 May 20, 2026 12:47

GustavoA1604 merged commit 08d39f0 into tetherto:speech May 20, 2026

GustavoA1604 mentioned this pull request May 20, 2026

feat: ggml-speech#4 (iOS Metal NULL-safety) + tts-cpp 2026-05-20 (Android dynamic backends) tetherto/qvac-registry-vcpkg#159

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-metal: NULL-safety hardening for iOS Metal init failures#10

ggml-metal: NULL-safety hardening for iOS Metal init failures#10
GustavoA1604 merged 3 commits into
tetherto:speechfrom
ishanvohra2:fix/metal-ios-null-propagation

ishanvohra2 commented May 18, 2026

Uh oh!

GustavoA1604 left a comment

Uh oh!

ishanvohra2 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ishanvohra2 commented May 18, 2026

Summary

1. ggml-metal: NULL-guard ggml_metal_buffer_is_shared

2. ggml-metal: propagate NULL through device init / backend reg / backend init

Effect

Test plan

Uh oh!

GustavoA1604 left a comment

Choose a reason for hiding this comment

Uh oh!

ishanvohra2 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `ggml-metal: NULL-guard ggml_metal_buffer_is_shared`

2. `ggml-metal: propagate NULL through device init / backend reg / backend init`