Skip to content

ggml-metal: NULL-safety hardening for iOS Metal init failures#10

Merged
GustavoA1604 merged 3 commits into
tetherto:speechfrom
ishanvohra2:fix/metal-ios-null-propagation
May 20, 2026
Merged

ggml-metal: NULL-safety hardening for iOS Metal init failures#10
GustavoA1604 merged 3 commits into
tetherto:speechfrom
ishanvohra2:fix/metal-ios-null-propagation

Conversation

@ishanvohra2

Copy link
Copy Markdown

Summary

Harden ggml-metal against transient Metal init failures on iOS. On iPhone 16 (A18 / G17P) running iOS 26.4, repeated warm relaunches of ggml-speech-based TTS engines (chatterbox via tts-cpp) intermittently crash inside Apple's AGXMetalG17P.bundle during ggml_metal_device_init() / ggml_metal_library_init() with EXC_BAD_ACCESS / KERN_INVALID_ADDRESS at offset 0x0 or 0x10. Root cause in all signatures we've seen is the same: a Metal init step (MTLCreateSystemDefaultDevice, library load, buffer alloc) returns nil/NULL but ggml-metal keeps the half-constructed ggml_metal_device alive, caches it in the static device list, and dereferences its fields seconds later from a different thread.

Two small commits, both pure NULL-safety hardening — no behavior change on the happy path:

1. ggml-metal: NULL-guard ggml_metal_buffer_is_shared

ggml_metal_buffer_init() can return NULL when MTLDevice newBufferWithLength: fails (Metal heap exhaustion, transient OS eviction on iOS warm relaunches). The standard caller pattern in ggml-metal.cpp does:

ggml_metal_buffer_t res = ggml_metal_buffer_init(ctx_dev, size, shared);
ggml_backend_buffer_i buf_i = ggml_metal_buffer_is_shared(res) // <-- deref NULL
    ? ggml_backend_metal_buffer_shared_i
    : ggml_backend_metal_buffer_private_i;

Add an explicit NULL check so the standard "allocator returned NULL" error path doesn't turn into a SIGSEGV at 0x10 (offset of is_shared in the struct).

2. ggml-metal: propagate NULL through device init / backend reg / backend init

Make the whole Metal init chain refuse to return a half-initialized device:

  • ggml_metal_device_init() — early-return NULL (and free(dev)) when MTLCreateSystemDefaultDevice is nil, and tear down the command queue + device handle when ggml_metal_library_init returns NULL. Previously the library failure was logged but the partially-built struct was still returned, crashing downstream the first time anything touched dev->library->....
  • ggml_metal_device_get_obj / _get_queue / _get_library — make the accessors NULL-safe so a caller holding a NULL device pointer can't turn it into a deep stack SIGSEGV.
  • ggml_metal_device_get() — when ggml_metal_device_init() returns NULL, don't emplace_back it into the static device list; otherwise the next call would hand back a stale dangling entry.
  • ggml_backend_metal_init() — explicitly handle "no Metal devices registered" / "device context is NULL" and return NULL so callers can fall back to CPU instead of crashing.
  • ggml_backend_metal_device_init() — return NULL when device context creation failed, instead of constructing a ggml_backend_device with a NULL context.
  • ggml_backend_metal_reg() — skip devices whose init returned NULL when populating the registry.

Effect

After these changes, Metal init failures surface as a clean

ggml_backend_metal_init: error: Metal device init failed - falling back to CPU

log line, and the caller sees a NULL backend instead of taking down the process. The user-visible OOM path (s3gen weights buffer alloc failure) is unchanged — this only converts previously-silent NULL deref crashes into reportable errors that the consumer (e.g. tts-cpp / tts-ggml) can surface to the SDK.

Test plan

  • Diff inspected; no behavior change on the happy path (all new code is on NULL/nil branches that previously crashed).
  • Compiles for macOS (host) and iOS arm64 (tts-ggml prebuild step).
  • Re-run tts-ggml chatterbox tests on iPhone 16 (iOS 26.4) with useGPU: true — expect a clean Metal device init failed log + CPU fallback (or clean error to SDK) on the runs that previously crashed during bootstrap.
  • Re-run on a Mac host where Metal succeeds to confirm zero regressions on the happy path.

QVAC SDK Pod added 2 commits May 18, 2026 19:36
ggml_metal_buffer_init() can return NULL when the underlying MTLDevice
newBufferWithLength: call fails (Metal heap exhaustion, transient OS
eviction on iOS warm relaunches, etc.). Callers that immediately probe
the result via ggml_metal_buffer_is_shared(res) previously dereferenced
NULL one line later through buf->is_shared, crashing with
EXC_BAD_ACCESS / KERN_INVALID_ADDRESS at offset 0x10.

Add a defensive NULL check so the standard 'allocator returned NULL'
error path doesn't turn into a SIGSEGV at the next call site.
…d init

On iPhone 16 (A18 / G17P) running iOS 26.4, repeated warm relaunches of
ggml-speech-based TTS engines occasionally crash inside Apple's Metal
driver during ggml_metal_device_init() / ggml_metal_library_init() with
EXC_BAD_ACCESS at offset 0x0. The crash signature varies (most often
inside the AGXMetalG17P bundle), but the common shape is that one of
the Metal init steps (MTLCreateSystemDefaultDevice, library load,
residency-set setup) hands back nil/NULL and ggml-metal continues as if
it had succeeded - keeping a half-constructed ggml_metal_device, caching
it in the static device list, and dereferencing its fields seconds later
during the first allocation.

Make the whole Metal init chain refuse to return a half-initialized
device:

  * ggml_metal_device_init():
      - if MTLCreateSystemDefaultDevice returns nil, log and return NULL
        (instead of falling through with mtl_device == nil and later
        dereferencing dev->mtl_device).
      - if ggml_metal_library_init() returns NULL, release the already-
        created command queue and device, free the partially-built
        ggml_metal_device, and return NULL. Previously this was logged
        but the half-built struct was still returned, which crashed
        downstream the moment anything touched dev->library->...

  * ggml_metal_device_get_obj / _get_queue / _get_library: NULL-safe
    accessors so a caller that managed to obtain a NULL device pointer
    cannot turn it into a SIGSEGV three frames deep.

  * ggml_metal_device_get(): when ggml_metal_device_init() returns NULL,
    do NOT emplace the NULL into the static device list (otherwise the
    next call to ggml_metal_device_get(device) would return a stale
    dangling entry).

  * ggml_backend_metal_init(): explicitly handle the "no Metal devices
    registered" / "device context is NULL" cases and return NULL so the
    consumer code path can fall back to CPU instead of crashing.

  * ggml_backend_metal_device_init(): return NULL when the underlying
    device context could not be created, instead of constructing a
    ggml_backend_device with a NULL context.

  * ggml_backend_metal_reg(): skip devices whose init returned NULL when
    populating the registry, rather than caching a NULL device that
    every subsequent ggml_backend_reg_dev_get() would hand back.

After these changes, Metal init failures surface as a clean
"ggml_backend_metal_init: Metal device init failed" log line in
device.log, and the caller sees a NULL backend instead of taking down
the process. The user-visible OOM path (s3gen weights buffer alloc
failure) is unchanged - this only converts the previously-silent NULL
deref crashes into reportable errors.

@GustavoA1604 GustavoA1604 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor flagged this, not sure if relevant:

  1. ggml_metal_buffer_is_shared NULL guard is partial (commit 1)

Returning false for a NULL buffer prevents the SIGSEGV at 0x10, but the NULL res is still forwarded to the next line in the caller. The canonical pattern from the PR description is:

ggml_metal_buffer_t res = ggml_metal_buffer_init(ctx_dev, size, shared);
ggml_backend_buffer_i buf_i = ggml_metal_buffer_is_shared(res) // now safe
? ggml_backend_metal_buffer_shared_i
: ggml_backend_metal_buffer_private_i;
// res and buf_i are then used together — res is still NULL here
Please do a quick audit of every ggml_metal_buffer_init call site in ggml-metal.cpp and confirm each one already has an explicit if (res == NULL) { return NULL; } guard before reaching ggml_metal_buffer_is_shared. If any site does not, this fix just shifts the crash by one line.

  1. Incomplete teardown on library init failure (commit 2)

The new cleanup in ggml_metal_device_init() when ggml_metal_library_init returns NULL only releases mtl_queue and mtl_device. However, a non-trivial amount of setup happens between queue creation and the library init call (residency sets, properties, etc.), so other resources allocated in that range are leaked on this error path. The cleaner fix is to call ggml_metal_device_free(dev) instead of open-coding the two releases — assuming device_free is safe to call on a partially-initialized struct (i.e., it NULL-guards fields like library before touching them). If it isn't safe today, that's worth a small follow-up fix, but the long-term shape should use the canonical teardown rather than manually freeing individual fields

… & use canonical device teardown

Address @GustavoA1604's review on tetherto#10:

1) NULL-guard ggml_metal_buffer_init / ggml_metal_buffer_map at every
   call site, not just inside ggml_metal_buffer_is_shared.

   The previous defensive guard inside ggml_metal_buffer_is_shared
   stopped the SIGSEGV at offset 0x10, but the NULL `res` was still
   forwarded to ggml_backend_buffer_init on the very next line, which
   would just shift the crash. Audit of every ggml_metal_buffer_init /
   ggml_metal_buffer_map call site in ggml-metal.cpp:

     - ggml_backend_metal_buffer_type_alloc_buffer (shared/private)
     - ggml_backend_metal_device_buffer_mapped (mapped host pointer)

   Both now early-return NULL when the underlying allocator returns
   NULL, so the failure surfaces as a NULL ggml_backend_buffer_t to the
   caller instead of a backend buffer wrapping a NULL context. The
   in-function ggml_metal_buffer_is_shared NULL guard is kept as
   defense-in-depth.

2) Replace open-coded teardown in ggml_metal_device_init() with
   ggml_metal_device_free(dev).

   The previous error path released only mtl_queue and mtl_device on
   ggml_metal_library_init() failure. ggml_metal_device_free() is
   already NULL-safe across every owned field (rsets, library,
   mtl_queue, mtl_device - all calloc()ed to zero in
   ggml_metal_device_init), so calling it on a partially-initialized
   struct is safe today and stays correct as new owned resources are
   added to ggml_metal_device or moved earlier in the init sequence.
   Same canonical teardown is now used for the
   MTLCreateSystemDefaultDevice == nil case for consistency.

No behavior change on the happy path.

Co-authored-by: Cursor <cursoragent@cursor.com>
@ishanvohra2

Copy link
Copy Markdown
Author

Thanks for the careful review @GustavoA1604 - both points were valid. Pushed as 6baf3baf:

1. ggml_metal_buffer_init call-site audit (commit 1 follow-up)

You're right - the in-function guard only stopped the immediate 0x10 deref, the NULL res would still flow into ggml_backend_buffer_init and shift the crash. Audited every call site in src/ggml-metal/ggml-metal.cpp:

  • ggml_backend_metal_buffer_type_alloc_buffer (shared / private path) - now if (res == NULL) return NULL; before constructing the backend buffer.
  • ggml_backend_metal_device_buffer_mapped (mapped host pointer path) - same guard added; ggml_metal_buffer_map has the same NULL-on-failure contract as ggml_metal_buffer_init.

Those are the only two sites that funnel ggml_metal_buffer_init / _map into ggml_backend_buffer_init. The remaining ggml_metal_buffer_is_shared references are all GGML_ASSERT(...) consistency checks on already-validated buffers, so they're not on a NULL-allocator code path.

I kept the in-function NULL guard inside ggml_metal_buffer_is_shared as defense-in-depth - it's a one-line check and means a future call site added without the early-return guard still won't deref 0x10.

2. Use ggml_metal_device_free(dev) for partial-init teardown (commit 2 follow-up)

Switched both error paths in ggml_metal_device_init() (the MTLCreateSystemDefaultDevice == nil case and the ggml_metal_library_init() == NULL case) to call ggml_metal_device_free(dev) instead of open-coding releases.

ggml_metal_device_free() is already safe on a partially-initialized struct: dev is calloc()ed in ggml_metal_device_init() (zero-init), and every cleanup path inside ggml_metal_device_free is NULL-guarded:

  • ggml_metal_rsets_free(dev->rsets) -> early-returns when rsets == NULL
  • ggml_metal_library_free(dev->library) -> early-returns when lib == NULL
  • mtl_queue / mtl_device releases are wrapped in if (...) guards

So we get the canonical teardown today and it stays correct if anything else gets added to ggml_metal_device or moved earlier in the init sequence.

@ishanvohra2 ishanvohra2 requested a review from GustavoA1604 May 20, 2026 12:47
GustavoA1604 added a commit to GustavoA1604/qvac-registry-vcpkg that referenced this pull request May 20, 2026
Picks up the 3-commit NULL-safety hardening from
tetherto/qvac-ext-ggml#10 (f21ff9a0 + b7dc01c7 + 0f54d9f7) cherry-
picked on top of `tetherto:speech` HEAD via the GustavoA1604 fork's
`speech-origin-plus-pr-10` branch.

Android backend stack unchanged: GGML_CPU_ALL_VARIANTS=ON +
GGML_CPU_REPACK=ON shape introduced in port-version 3 stays. Only
the Apple Metal init path changes -- transient
`MTLCreateSystemDefaultDevice` / `ggml_metal_library_init` / buffer-
alloc failures now surface as a clean
`ggml_backend_metal_init: error: Metal device init failed - falling
back to CPU` log line + NULL return, instead of crashing inside
AGXMetalG17P.bundle with EXC_BAD_ACCESS (observed on iPhone 16 /
iOS 26.4 warm relaunches).

Flip REPO back to `tetherto/qvac-ext-ggml` + bump REF/SHA512 once
PR-10 merges into tetherto:speech upstream.

Co-authored-by: Cursor <cursoragent@cursor.com>
@GustavoA1604 GustavoA1604 merged commit 08d39f0 into tetherto:speech May 20, 2026
GustavoA1604 added a commit to GustavoA1604/qvac-registry-vcpkg that referenced this pull request May 20, 2026
Picks up the iOS Metal NULL-safety hardening from
tetherto/qvac-ext-ggml#10 (PR-10's 3 commits f21ff9a0 + b7dc01c7 +
0f54d9f7) now merged into the `speech` branch as PR-10 merge
08d39f0c.

Android backend stack unchanged: GGML_CPU_ALL_VARIANTS=ON +
GGML_CPU_REPACK=ON shape introduced in port-version 3 stays. Only
the Apple Metal init path changes -- transient
`MTLCreateSystemDefaultDevice` / `ggml_metal_library_init` / buffer-
alloc failures now surface as a clean
`ggml_backend_metal_init: error: Metal device init failed - falling
back to CPU` log line + NULL return, instead of crashing inside
AGXMetalG17P.bundle with EXC_BAD_ACCESS (observed on iPhone 16 /
iOS 26.4 warm relaunches).

Co-authored-by: Cursor <cursoragent@cursor.com>
GustavoA1604 added a commit to tetherto/qvac-registry-vcpkg that referenced this pull request May 20, 2026
…roid dynamic backends) (#159)

* ggml-speech: bump to port-version 4 (PR #10 iOS Metal NULL-safety)

Picks up the iOS Metal NULL-safety hardening from
tetherto/qvac-ext-ggml#10 (PR-10's 3 commits f21ff9a0 + b7dc01c7 +
0f54d9f7) now merged into the `speech` branch as PR-10 merge
08d39f0c.

Android backend stack unchanged: GGML_CPU_ALL_VARIANTS=ON +
GGML_CPU_REPACK=ON shape introduced in port-version 3 stays. Only
the Apple Metal init path changes -- transient
`MTLCreateSystemDefaultDevice` / `ggml_metal_library_init` / buffer-
alloc failures now surface as a clean
`ggml_backend_metal_init: error: Metal device init failed - falling
back to CPU` log line + NULL return, instead of crashing inside
AGXMetalG17P.bundle with EXC_BAD_ACCESS (observed on iPhone 16 /
iOS 26.4 warm relaunches).

Co-authored-by: Cursor <cursoragent@cursor.com>

* tts-cpp: bump to 2026-05-20 + tighten ggml-speech to #4

Pulls in the Android dynamic-backend selection landed on
tetherto/qvac-ext-lib-whisper.cpp@master via PR #29 (merge 60dc1504,
content commit 907f3151): registry-only `init_gpu_backend()`
(Adreno 700+ -> OpenCL, every other GPU -> Vulkan / Metal / CUDA),
new EngineOptions::backends_dir / opencl_cache_dir fields wired
through to `tts_cpp::detail::set_backends_directory` /
`tts_cpp::detail::set_opencl_cache_dir`, and the GGML_BACKEND_DL=ON
Android CMake defaults so `ggml_backend_load_all_from_path()`
discovers per-arch CPU + Vulkan + OpenCL .so files installed
alongside the bare module.

Also tightens the ggml-speech dependency to `version>=: 2026-04-09#4`
so consumers without an explicit constraint of their own pull
ggml-speech port-version 4 (PR-10 iOS Metal NULL-safety) instead of
the registry baseline #1. tts-cpp's new `init_gpu_backend()` walks
the registry blindly and a NULL-from-Metal-init is now handled
cleanly thanks to PR-10, so #4 is the lowest version this port is
safe against.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants