ggml-metal: NULL-safety hardening for iOS Metal init failures#10
Conversation
ggml_metal_buffer_init() can return NULL when the underlying MTLDevice newBufferWithLength: call fails (Metal heap exhaustion, transient OS eviction on iOS warm relaunches, etc.). Callers that immediately probe the result via ggml_metal_buffer_is_shared(res) previously dereferenced NULL one line later through buf->is_shared, crashing with EXC_BAD_ACCESS / KERN_INVALID_ADDRESS at offset 0x10. Add a defensive NULL check so the standard 'allocator returned NULL' error path doesn't turn into a SIGSEGV at the next call site.
…d init
On iPhone 16 (A18 / G17P) running iOS 26.4, repeated warm relaunches of
ggml-speech-based TTS engines occasionally crash inside Apple's Metal
driver during ggml_metal_device_init() / ggml_metal_library_init() with
EXC_BAD_ACCESS at offset 0x0. The crash signature varies (most often
inside the AGXMetalG17P bundle), but the common shape is that one of
the Metal init steps (MTLCreateSystemDefaultDevice, library load,
residency-set setup) hands back nil/NULL and ggml-metal continues as if
it had succeeded - keeping a half-constructed ggml_metal_device, caching
it in the static device list, and dereferencing its fields seconds later
during the first allocation.
Make the whole Metal init chain refuse to return a half-initialized
device:
* ggml_metal_device_init():
- if MTLCreateSystemDefaultDevice returns nil, log and return NULL
(instead of falling through with mtl_device == nil and later
dereferencing dev->mtl_device).
- if ggml_metal_library_init() returns NULL, release the already-
created command queue and device, free the partially-built
ggml_metal_device, and return NULL. Previously this was logged
but the half-built struct was still returned, which crashed
downstream the moment anything touched dev->library->...
* ggml_metal_device_get_obj / _get_queue / _get_library: NULL-safe
accessors so a caller that managed to obtain a NULL device pointer
cannot turn it into a SIGSEGV three frames deep.
* ggml_metal_device_get(): when ggml_metal_device_init() returns NULL,
do NOT emplace the NULL into the static device list (otherwise the
next call to ggml_metal_device_get(device) would return a stale
dangling entry).
* ggml_backend_metal_init(): explicitly handle the "no Metal devices
registered" / "device context is NULL" cases and return NULL so the
consumer code path can fall back to CPU instead of crashing.
* ggml_backend_metal_device_init(): return NULL when the underlying
device context could not be created, instead of constructing a
ggml_backend_device with a NULL context.
* ggml_backend_metal_reg(): skip devices whose init returned NULL when
populating the registry, rather than caching a NULL device that
every subsequent ggml_backend_reg_dev_get() would hand back.
After these changes, Metal init failures surface as a clean
"ggml_backend_metal_init: Metal device init failed" log line in
device.log, and the caller sees a NULL backend instead of taking down
the process. The user-visible OOM path (s3gen weights buffer alloc
failure) is unchanged - this only converts the previously-silent NULL
deref crashes into reportable errors.
GustavoA1604
left a comment
There was a problem hiding this comment.
Cursor flagged this, not sure if relevant:
- ggml_metal_buffer_is_shared NULL guard is partial (commit 1)
Returning false for a NULL buffer prevents the SIGSEGV at 0x10, but the NULL res is still forwarded to the next line in the caller. The canonical pattern from the PR description is:
ggml_metal_buffer_t res = ggml_metal_buffer_init(ctx_dev, size, shared);
ggml_backend_buffer_i buf_i = ggml_metal_buffer_is_shared(res) // now safe
? ggml_backend_metal_buffer_shared_i
: ggml_backend_metal_buffer_private_i;
// res and buf_i are then used together — res is still NULL here
Please do a quick audit of every ggml_metal_buffer_init call site in ggml-metal.cpp and confirm each one already has an explicit if (res == NULL) { return NULL; } guard before reaching ggml_metal_buffer_is_shared. If any site does not, this fix just shifts the crash by one line.
- Incomplete teardown on library init failure (commit 2)
The new cleanup in ggml_metal_device_init() when ggml_metal_library_init returns NULL only releases mtl_queue and mtl_device. However, a non-trivial amount of setup happens between queue creation and the library init call (residency sets, properties, etc.), so other resources allocated in that range are leaked on this error path. The cleaner fix is to call ggml_metal_device_free(dev) instead of open-coding the two releases — assuming device_free is safe to call on a partially-initialized struct (i.e., it NULL-guards fields like library before touching them). If it isn't safe today, that's worth a small follow-up fix, but the long-term shape should use the canonical teardown rather than manually freeing individual fields
… & use canonical device teardown Address @GustavoA1604's review on tetherto#10: 1) NULL-guard ggml_metal_buffer_init / ggml_metal_buffer_map at every call site, not just inside ggml_metal_buffer_is_shared. The previous defensive guard inside ggml_metal_buffer_is_shared stopped the SIGSEGV at offset 0x10, but the NULL `res` was still forwarded to ggml_backend_buffer_init on the very next line, which would just shift the crash. Audit of every ggml_metal_buffer_init / ggml_metal_buffer_map call site in ggml-metal.cpp: - ggml_backend_metal_buffer_type_alloc_buffer (shared/private) - ggml_backend_metal_device_buffer_mapped (mapped host pointer) Both now early-return NULL when the underlying allocator returns NULL, so the failure surfaces as a NULL ggml_backend_buffer_t to the caller instead of a backend buffer wrapping a NULL context. The in-function ggml_metal_buffer_is_shared NULL guard is kept as defense-in-depth. 2) Replace open-coded teardown in ggml_metal_device_init() with ggml_metal_device_free(dev). The previous error path released only mtl_queue and mtl_device on ggml_metal_library_init() failure. ggml_metal_device_free() is already NULL-safe across every owned field (rsets, library, mtl_queue, mtl_device - all calloc()ed to zero in ggml_metal_device_init), so calling it on a partially-initialized struct is safe today and stays correct as new owned resources are added to ggml_metal_device or moved earlier in the init sequence. Same canonical teardown is now used for the MTLCreateSystemDefaultDevice == nil case for consistency. No behavior change on the happy path. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Thanks for the careful review @GustavoA1604 - both points were valid. Pushed as 1. You're right - the in-function guard only stopped the immediate
Those are the only two sites that funnel I kept the in-function NULL guard inside 2. Use Switched both error paths in
So we get the canonical teardown today and it stays correct if anything else gets added to |
Picks up the 3-commit NULL-safety hardening from tetherto/qvac-ext-ggml#10 (f21ff9a0 + b7dc01c7 + 0f54d9f7) cherry- picked on top of `tetherto:speech` HEAD via the GustavoA1604 fork's `speech-origin-plus-pr-10` branch. Android backend stack unchanged: GGML_CPU_ALL_VARIANTS=ON + GGML_CPU_REPACK=ON shape introduced in port-version 3 stays. Only the Apple Metal init path changes -- transient `MTLCreateSystemDefaultDevice` / `ggml_metal_library_init` / buffer- alloc failures now surface as a clean `ggml_backend_metal_init: error: Metal device init failed - falling back to CPU` log line + NULL return, instead of crashing inside AGXMetalG17P.bundle with EXC_BAD_ACCESS (observed on iPhone 16 / iOS 26.4 warm relaunches). Flip REPO back to `tetherto/qvac-ext-ggml` + bump REF/SHA512 once PR-10 merges into tetherto:speech upstream. Co-authored-by: Cursor <cursoragent@cursor.com>
Picks up the iOS Metal NULL-safety hardening from tetherto/qvac-ext-ggml#10 (PR-10's 3 commits f21ff9a0 + b7dc01c7 + 0f54d9f7) now merged into the `speech` branch as PR-10 merge 08d39f0c. Android backend stack unchanged: GGML_CPU_ALL_VARIANTS=ON + GGML_CPU_REPACK=ON shape introduced in port-version 3 stays. Only the Apple Metal init path changes -- transient `MTLCreateSystemDefaultDevice` / `ggml_metal_library_init` / buffer- alloc failures now surface as a clean `ggml_backend_metal_init: error: Metal device init failed - falling back to CPU` log line + NULL return, instead of crashing inside AGXMetalG17P.bundle with EXC_BAD_ACCESS (observed on iPhone 16 / iOS 26.4 warm relaunches). Co-authored-by: Cursor <cursoragent@cursor.com>
…roid dynamic backends) (#159) * ggml-speech: bump to port-version 4 (PR #10 iOS Metal NULL-safety) Picks up the iOS Metal NULL-safety hardening from tetherto/qvac-ext-ggml#10 (PR-10's 3 commits f21ff9a0 + b7dc01c7 + 0f54d9f7) now merged into the `speech` branch as PR-10 merge 08d39f0c. Android backend stack unchanged: GGML_CPU_ALL_VARIANTS=ON + GGML_CPU_REPACK=ON shape introduced in port-version 3 stays. Only the Apple Metal init path changes -- transient `MTLCreateSystemDefaultDevice` / `ggml_metal_library_init` / buffer- alloc failures now surface as a clean `ggml_backend_metal_init: error: Metal device init failed - falling back to CPU` log line + NULL return, instead of crashing inside AGXMetalG17P.bundle with EXC_BAD_ACCESS (observed on iPhone 16 / iOS 26.4 warm relaunches). Co-authored-by: Cursor <cursoragent@cursor.com> * tts-cpp: bump to 2026-05-20 + tighten ggml-speech to #4 Pulls in the Android dynamic-backend selection landed on tetherto/qvac-ext-lib-whisper.cpp@master via PR #29 (merge 60dc1504, content commit 907f3151): registry-only `init_gpu_backend()` (Adreno 700+ -> OpenCL, every other GPU -> Vulkan / Metal / CUDA), new EngineOptions::backends_dir / opencl_cache_dir fields wired through to `tts_cpp::detail::set_backends_directory` / `tts_cpp::detail::set_opencl_cache_dir`, and the GGML_BACKEND_DL=ON Android CMake defaults so `ggml_backend_load_all_from_path()` discovers per-arch CPU + Vulkan + OpenCL .so files installed alongside the bare module. Also tightens the ggml-speech dependency to `version>=: 2026-04-09#4` so consumers without an explicit constraint of their own pull ggml-speech port-version 4 (PR-10 iOS Metal NULL-safety) instead of the registry baseline #1. tts-cpp's new `init_gpu_backend()` walks the registry blindly and a NULL-from-Metal-init is now handled cleanly thanks to PR-10, so #4 is the lowest version this port is safe against. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
Harden
ggml-metalagainst transient Metal init failures on iOS. On iPhone 16 (A18 / G17P) running iOS 26.4, repeated warm relaunches ofggml-speech-based TTS engines (chatterbox viatts-cpp) intermittently crash inside Apple'sAGXMetalG17P.bundleduringggml_metal_device_init()/ggml_metal_library_init()withEXC_BAD_ACCESS/KERN_INVALID_ADDRESSat offset0x0or0x10. Root cause in all signatures we've seen is the same: a Metal init step (MTLCreateSystemDefaultDevice, library load, buffer alloc) returnsnil/NULLbutggml-metalkeeps the half-constructedggml_metal_devicealive, caches it in the static device list, and dereferences its fields seconds later from a different thread.Two small commits, both pure NULL-safety hardening — no behavior change on the happy path:
1.
ggml-metal: NULL-guard ggml_metal_buffer_is_sharedggml_metal_buffer_init()can returnNULLwhenMTLDevice newBufferWithLength:fails (Metal heap exhaustion, transient OS eviction on iOS warm relaunches). The standard caller pattern inggml-metal.cppdoes:Add an explicit
NULLcheck so the standard "allocator returned NULL" error path doesn't turn into aSIGSEGV at 0x10(offset ofis_sharedin the struct).2.
ggml-metal: propagate NULL through device init / backend reg / backend initMake the whole Metal init chain refuse to return a half-initialized device:
ggml_metal_device_init()— early-returnNULL(andfree(dev)) whenMTLCreateSystemDefaultDeviceisnil, and tear down the command queue + device handle whenggml_metal_library_initreturnsNULL. Previously the library failure was logged but the partially-built struct was still returned, crashing downstream the first time anything toucheddev->library->....ggml_metal_device_get_obj/_get_queue/_get_library— make the accessors NULL-safe so a caller holding a NULL device pointer can't turn it into a deep stackSIGSEGV.ggml_metal_device_get()— whenggml_metal_device_init()returnsNULL, don'templace_backit into the static device list; otherwise the next call would hand back a stale dangling entry.ggml_backend_metal_init()— explicitly handle "no Metal devices registered" / "device context is NULL" and returnNULLso callers can fall back to CPU instead of crashing.ggml_backend_metal_device_init()— returnNULLwhen device context creation failed, instead of constructing aggml_backend_devicewith aNULLcontext.ggml_backend_metal_reg()— skip devices whose init returnedNULLwhen populating the registry.Effect
After these changes, Metal init failures surface as a clean
log line, and the caller sees a
NULLbackend instead of taking down the process. The user-visible OOM path (s3gen weights buffer alloc failure) is unchanged — this only converts previously-silentNULLderef crashes into reportable errors that the consumer (e.g.tts-cpp/tts-ggml) can surface to the SDK.Test plan
NULL/nilbranches that previously crashed).tts-ggmlprebuild step).tts-ggmlchatterbox tests on iPhone 16 (iOS 26.4) withuseGPU: true— expect a cleanMetal device init failedlog + CPU fallback (or clean error to SDK) on the runs that previously crashed during bootstrap.