Core: Fix RID_Alloc Lock-Free Reads on Weakly-Ordered Architectures by stuartcarnie · Pull Request #114937 · godotengine/godot

stuartcarnie · 2026-01-14T01:46:59Z

This PR fixes a memory ordering bug in RID_Alloc that caused data races on weakly-ordered architectures (ARM, Apple Silicon, POWER, RISC-V). The fix adds proper memory_order_release/memory_order_acquire semantics to both max_alloc and validator atomic operations.

Important

🤖 AI disclosure

I used AI to help identify and fix the issue. I also used AI to help describe the problem in detail, which I reviewed and edited. I have reviewed the code, understand it and tested a release build with the original MRP, Bistro Demo, and can confirm it works with commit 91c9e27.

The Bug

The RID_Alloc class uses a lock-free read path in get_or_null() and owns() for performance. These functions read shared state without holding the mutex, but the original code used memory_order_relaxed which provides no ordering guarantees.

Original Code (Broken)

// In _allocate_rid() - writer side
chunks[chunk_count][i].validator = 0xFFFFFFFF;     // 1. Initialize chunk (rare, when growing)
((std::atomic<uint32_t> *)&max_alloc)->store(...,
    std::memory_order_relaxed);                     // 2. Publish chunk (NO ordering)
chunks[idx][elem].validator = validator | 0x80000000; // 3. Set validator (EVERY allocation)
mutex.unlock();

// In get_or_null() / owns() - reader side
SYNC_ACQUIRE;  // ← Fence BEFORE load (wrong position!)
ma = ((std::atomic<uint32_t> *)&max_alloc)->load(
    std::memory_order_relaxed);                     // ← NO ordering guarantee
// ... access validator (non-atomic read!)

The Race Conditions

Race 1: Chunk initialization (when growing)

Thread A initializes chunk data, stores max_alloc with relaxed ordering
Thread B loads max_alloc with relaxed ordering, sees new value
Thread B accesses chunk but sees stale/uninitialized data

Race 2: Validator (every allocation)

Thread A writes validator (non-atomic), unlocks mutex
Thread B loads max_alloc, accesses validator
Thread B sees stale validator because there's no synchronization

The second race is the critical one—max_alloc only changes when chunks grow (rare), but validator changes on every allocation.

Why `WEAK_MEMORY_ORDER 1` Didn't Fix It

The codebase had a WEAK_MEMORY_ORDER macro that used atomic_thread_fence:

#define SYNC_ACQUIRE std::atomic_thread_fence(std::memory_order_acquire);

This failed for two reasons:

Problem 1: Fence Position

SYNC_ACQUIRE;  // Fence here...
ma = load(max_alloc, relaxed);  // ...then load (WRONG!)

Acquire fences must come after the load, not before.

Problem 2: No Synchronization on Validator

Even with correct max_alloc synchronization, the validator field was accessed non-atomically. The max_alloc release/acquire only synchronizes chunk structure, not per-slot validator writes.

The Fix

Change 1: Release Store for Validator in `_allocate_rid()`

// Before: two non-atomic writes
chunks[free_chunk][free_element].validator = validator;
chunks[free_chunk][free_element].validator |= 0x80000000;

// After: single atomic release store
((std::atomic<uint32_t> *)&chunks[free_chunk][free_element].validator)
    ->store(validator | 0x80000000, std::memory_order_release);

What memory_order_release guarantees:

A release store acts as a one-way barrier that prevents preceding operations from being reordered after it:

writes to chunks[n]           ─┐
writes to free_list           │  Cannot move below the release store
                              ─┘
         ▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼
   store(validator, release)  [RELEASE BARRIER]
         ▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲

This ensures that when another thread observes the validator, all initialization that preceded the store is guaranteed visible.

Change 2: Acquire Load for Validator in `get_or_null()` and `owns()`

// Before: non-atomic read with misplaced fence
SYNC_ACQUIRE;
// ... later ...
if (c.validator != validator) { ... }

// After: atomic acquire load
uint32_t cur_validator;
if constexpr (THREAD_SAFE) {
    cur_validator = ((std::atomic<uint32_t> *)&chunks[idx_chunk][idx_element].validator)
        ->load(std::memory_order_acquire);
} else {
    cur_validator = chunks[idx_chunk][idx_element].validator;
}

What memory_order_acquire guarantees:

An acquire load acts as a one-way barrier that prevents subsequent operations from being reordered before it:

         ▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼
   load(validator, acquire)   [ACQUIRE BARRIER]
         ▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲
                              ─┐
read chunk data                │  Cannot move above the acquire load
                              ─┘

Change 3: Release Store in `get_or_null()` When Clearing Init Bit

if constexpr (THREAD_SAFE) {
    ((std::atomic<uint32_t> *)&chunks[idx_chunk][idx_element].validator)
        ->store(cur_validator & 0x7FFFFFFF, std::memory_order_release);
} else {
    chunks[idx_chunk][idx_element].validator = cur_validator & 0x7FFFFFFF;
}

Change 4: Retain `max_alloc` Release/Acquire

The max_alloc synchronization is still needed for chunk structure visibility:

// _allocate_rid() - when growing chunks
((std::atomic<uint32_t> *)&max_alloc)->store(max_alloc + elements_in_chunk, std::memory_order_release);

// get_or_null() / owns()
ma = ((std::atomic<uint32_t> *)&max_alloc)->load(std::memory_order_acquire);

Complete Synchronization Model

Thread A (_allocate_rid):              Thread B (get_or_null/owns):
────────────────────────              ────────────────────────────
[if growing chunks:]
  init chunk data
  store(max_alloc, release)  ────────► load(max_alloc, acquire)
                                              │
store(validator, release)    ────────────────►│
                                              ▼
                                       load(validator, acquire)
                                              │
                                              ▼
                                       access data ✓

Both synchronization points are necessary:

max_alloc: ensures chunk structure is visible (when chunks grow)
validator: ensures per-slot allocation is visible (every allocation)

How This Works on Different Architectures

Strongly-Ordered (x86/x64)

x86 has Total Store Order (TSO) where stores aren't reordered with stores, and loads aren't reordered with loads. The original relaxed operations happened to work at the hardware level, but the compiler could still reorder. With release/acquire, the compiler is constrained and hardware behavior is unchanged.

Weakly-Ordered (ARM/Apple Silicon)

ARM permits almost any reordering unless explicitly prevented:

Release store compiles to STLR instruction—all prior writes complete first
Acquire load compiles to LDAR instruction—all subsequent reads wait

The C++ memory model abstracts this, making the code portable across all architectures.

Why Lock-Free Instead of Mutex?

The lock-free read path provides:

No contention: Multiple readers proceed simultaneously
No syscalls: Mutex operations may involve kernel transitions
Better scalability: Critical for frequently-called owns() validation

The trade-off is complexity, but with proper acquire-release semantics on both max_alloc and validator, the code is correct and efficient.

TSAN Annotations Removed

The previous code had __tsan_acquire/__tsan_release annotations to suppress ThreadSanitizer warnings. These are no longer needed because the atomic operations with proper memory ordering are recognized by TSAN as correct synchronization.

References

Calinou

Tested locally, it works as expected. Code looks good to me.

However, I couldn't reproduce the original bug, so additional testing is needed.

Here are macOS arm64 release export template binaries for testing:

master: 114937_master.zip
This PR: 114937_pr.zip

SCons options used for both binaries:

target=template_release optimize=speed disable_path_overrides=no

stuartcarnie · 2026-01-14T04:17:24Z

@Calinou those links aren't working for me

stuartcarnie · 2026-01-14T04:52:38Z

Not fixed with my changes; will revert and reevaluate.

bruvzg

I can't reproduce the issue (M4 Pro, release template build), but using release-acquire ordering seems to make sense.

blueskythlikesclouds · 2026-01-14T09:32:07Z

Not fixed with my changes; will revert and reevaluate.

Same on my M4 Pro, I tried master and this PR, both cause the particle to stay at origin.

Alex2782 · 2026-01-14T15:19:42Z

When I compile Godot Editor myself on my MacMini M1, I cannot reproduce the issues with scons production=yes with and without optimize=speed.

So far I have been able to reproduce the issue with the official dev4 and beta2 versions.
#114900 (comment)

This should affect any official version created from November 12th onwards (dev4+).
https://godotengine.org/download/archive/

stuartcarnie · 2026-01-14T18:20:32Z

Not fixed with my changes; will revert and reevaluate.

Same on my M4 Pro, I tried master and this PR, both cause the particle to stay at origin.

Did you try with commit 023fcfc?

stuartcarnie · 2026-01-14T18:30:29Z

With commit 023fcfc I am still able to reproduce with CI/CD build 🤔

akien-mga · 2026-01-14T18:33:49Z

With commit 023fcfc I am still able to reproduce with CI/CD build 🤔

Then at least I'm happy to see it's reproducible at least in some cases with a proper Xcode build, and isn't specific to our official builds made with osxcross. I was starting to worry we'd have to dig deep into that one again.

stuartcarnie · 2026-01-14T18:39:56Z

Looking more closely at the code, the validator access is not synchronised properly with the _allocate_rid. Mixing mutex and lock-free is challenging.

I'd say we revert #112657, as @akien-mga suggested and revisit for 4.7. We should open a separate PR and I'll use this one to working on a complete fix.

bruvzg · 2026-01-14T18:45:02Z

For the reference, I can't reproduce it with artifact from this PR or RC1 template as well.

stuartcarnie · 2026-01-14T18:48:02Z

@bruvzg what version of macOS?

Updated `_allocate_rid` to use release when storing max, ensuring earlier stores cannot be reordered **after**. Update of `owns` and `get_or_null` using acquire when loading max, ensuring later reads cannot be ordered **before**.

stuartcarnie · 2026-01-14T18:50:54Z

We weren't using release / acquire semantics for the validator field, which I've updated. I can't reproduce with a local release build. I'll try with CI/CD when it finishes.

Also note that get_or_null is probably broken with the existing code and prior to #112657, as that was based on the implementation of get_or_null.

stuartcarnie · 2026-01-14T18:51:17Z

@blueskythlikesclouds can you try with my latest commit?

bruvzg · 2026-01-14T18:52:30Z

what version of macOS?

26.2 (25C56)

stuartcarnie · 2026-01-14T18:54:12Z

what version of macOS?

26.2 (25C56)

Same as me 👍🏻

bruvzg · 2026-01-14T19:01:54Z

Additionally tested it on VM with 15.7.1 (24G231), same results, can't reproduce it with any version.

blueskythlikesclouds · 2026-01-14T20:11:45Z

@blueskythlikesclouds can you try with my latest commit?

Still happening on my end.

akien-mga · 2026-01-14T21:04:30Z

I'd say we revert #112657, as @akien-mga suggested and revisit for 4.7. We should open a separate PR and I'll use this one to working on a complete fix.

I opened #114963 to revert #112657, in case this still seems to be the best course of action.

Would be good if folks who can reproduce the bug reliably can test #114963 to confirm that it does solve the problem.

stuartcarnie · 2026-01-14T21:20:29Z

If I build a release locally with Xcode, I'm not able to reproduce the issue, but pulling down the CI build does. 🤷🏻

akien-mga · 2026-01-14T22:04:49Z

CI builds are made with Xcode 26.0.1:

godot/.github/workflows/macos_builds.yml

Lines 40 to 42 in 481f36e

    
           # From https://github.com/actions/runner-images/blob/main/images/macos 
        
           - name: Select Xcode 26 
        
             run: sudo xcode-select -s /Applications/Xcode_26.0.1.app

Official builds are using SDKs from Xcode 26.1.1:

https://github.com/godotengine/build-containers?tab=readme-ov-file#toolchains

Maybe it's something that got fixed in Xcode 26.2?

stuartcarnie · 2026-01-14T22:48:29Z

I'll try updating the Xcode image to see if that makes a difference.

stuartcarnie · 2026-01-14T23:46:47Z

@akien-mga your suspicion was correct – Xcode 26.2 resolves the issue!

@blueskythlikesclouds, et al, can you try this binary: https://github.com/godotengine/godot/actions/runs/21012696146/artifacts/5133991403

jeffuntildeath · 2026-01-15T00:47:48Z

I built and tested the patch with Xcode 26.2 and can still reproduce it

stuartcarnie · 2026-01-15T00:48:56Z

I'm building off my CMake project; so now I need to figure out what is different than building with scons

stuartcarnie · 2026-01-15T00:50:50Z

@jeffuntildeath were you reproducing with your original mrp?

jeffuntildeath · 2026-01-15T01:21:57Z

Yes, same project.

…

On Wed, Jan 14, 2026, 6:51 PM Stuart Carnie ***@***.***> wrote: *stuartcarnie* left a comment (godotengine/godot#114937) <#114937 (comment)> @jeffuntildeath <https://github.com/jeffuntildeath> were you reproducing with your original mrp? — Reply to this email directly, view it on GitHub <#114937 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASBUKHMLAHQYDEK5RVLKQ2L4G3QAFAVCNFSM6AAAAACRTTSH3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTONJSGM3DIMRTGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

stuartcarnie · 2026-01-15T02:38:25Z

@jeffuntildeath would you mind trying this build: https://github.com/godotengine/godot/actions/runs/21012696146/artifacts/5133991403

Alex2782 · 2026-01-15T02:39:07Z

I built and tested the patch with Xcode 26.2 and can still reproduce it

How exactly did you execute scons? @jeffuntildeath
I could only reproduce the issue with official versions (Xcode 26.1.1), not when compiling myself. (master branch)

xcodebuild -version          
Xcode 26.2
Build version 17C52

jeffuntildeath · 2026-01-15T04:23:33Z

@Alex2782

same:
xcodebuild -version
Xcode 26.2
Build version 17C52

executing 'scons' with no arguments

jeffuntildeath · 2026-01-15T04:37:53Z

@stuartcarnie Yes, it is reproducible using the build you supplied.

RandomShaper · 2026-01-15T09:22:56Z

Let me give my assessment here. The fact that the loads and stores to max_alloc are relaxed is very intentional. The idea is that, provided the mutex and fences are working correctly, that variable doesn't need specific synchronization. The operation is using atomics precisely for that, atomicity, because such variable can be written and read concurrently and we wouldn't want a half-written value in any case (provided it's aligned, probably no real-world architecture would fall into that and so the code generated is the same as for regular, non-atomic operations). That's the rationale behind this whole approach.

With that in mind, I'd say the root issue here is that the stray fences don't synchronize with the acquire-release of the mutex. I can state some different fixes for that:

a) Instead of using a Mutex, we'd use a SpinLock. With that, we would replace the fences with atomic acquire/release operation on its locked variable, which would obviously synchronize with the locking and unlocking operations, as they work on that very variable.
b) Keep the Mutex, but also issue acquire/release fences around the calls to lock and unlock it. That way, synchronization among every piece of logic involved would happen. We'd thus be using the mutex solely because of its critical section feature, with its acquire-release semantics being irrelevant. For complete correctness, this would also require making all the relevant loads and stores atomic, even if relaxed.
c) Use C++20 to make a Mutex implementation based on atomics and their new wait and notify_one operations. I haven't dealt with this personally, but if I understand correctly, this allows to create a mutex that you can both lock-unlock and lockessly synchronize with, with all those operations synchronizing among them. (For this, we may want to compile just the mutex as C++20.)

@akien-mga, would you like me to provide PRs for any of these approaches?

akien-mga · 2026-01-15T10:10:08Z

@akien-mga your suspicion was correct – Xcode 26.2 resolves the issue!

Here's a build of 4.6-rc1 with Xcode 26.2 made with the same toolchain as the official 4.6-rc1 build using Xcode 26.1.1:

https://github.com/akien-mga/godot/releases/tag/akien%2F4.6-rc1-xcode26.2
(export templates include macOS and iOS templates)

Would be good to get this tested to see if Xcode 26.2 does improve the situation, though early reports from @jeffuntildeath using the CI builds seem to infer that it may not.

I'm starting to wonder if there's some UB here where it might behave differently not just based on the compiler, but on ASLR / memory mapping etc.

Another question is whether this bug is reproducible on iOS, or only on macOS?

@RandomShaper I'm not qualified to assess which option makes the most sense, but given the time (we're at RC1 and we shouldn't make any risky core change at this time), I think we'll likely go ahead with reverting #112657.

For 4.7, we should retry this optimization and maybe include the approach you think is best to fix it fully. We'll have to benchmark to see whether the regression fix for the optimization doesn't undo the performance benefits.

Regarding option (c) using C++20, this is not an option right now as we're using C++17 (though I think we do promote to C++20 for Metal so if this is constrained to macOS/iOS, this could work). But if we keep these changes for 4.7 anyway, then C++20 might be a good option as we currently aim to merge #100749 for 4.7.

RandomShaper · 2026-01-15T10:39:49Z

Theoretically, this bug could bite on any multi-core architecture. However, in practice, it will only cause issues on ARM devices, regardless of it being iOS, MacOS or Android. However, it may be the case that aspects such as the number of cores make it harder to reproduce on some.

Reverting the owns() optimization would still leave an incorrect get_or_null(), as far as I can tell.

I'm making a PR with what I deem the approach that diverges from the current code the least. It would be great if the people ablle to reproduce the issue could test it...

RandomShaper · 2026-01-15T12:13:39Z

@akien-mga, I think I'm stepping out of this for now, as my attempts to fix it on #114980 are not being successful. I'll take a look to this again when I have more time to analyze the matter. For the time being, for the next release, if the revert way is good enough, you have my blessing.

Alex2782 · 2026-01-15T16:25:08Z

Here's a build of 4.6-rc1 with Xcode 26.2 made with the same toolchain as the official 4.6-rc1 build using Xcode 26.1.1:

With this version I can also reproduce the issues on my MacMini M1.

Screenshot

executing 'scons' with no arguments

Yes, then I can reproduce it, but not if I compile with production=yes, strange.

akien-mga · 2026-01-16T12:17:05Z

Superseded by #114963.

stuartcarnie requested a review from a team as a code owner January 14, 2026 01:47

clayjohn added bug topic:core regression labels Jan 14, 2026

clayjohn added this to the 4.6 milestone Jan 14, 2026

Calinou approved these changes Jan 14, 2026

View reviewed changes

stuartcarnie force-pushed the apple_arm_lock_free branch from a261c27 to 8f0f066 Compare January 14, 2026 04:51

stuartcarnie force-pushed the apple_arm_lock_free branch from 8f0f066 to 023fcfc Compare January 14, 2026 05:54

stuartcarnie changed the title ~~Core: Enable WEAK memory ordering macro on Apple Silicon~~ Core: Fix RID_Alloc Lock-Free Reads on Weakly-Ordered Architectures Jan 14, 2026

bruvzg reviewed Jan 14, 2026

View reviewed changes

Ivorforce requested a review from RandomShaper January 14, 2026 14:07

stuartcarnie force-pushed the apple_arm_lock_free branch from 023fcfc to 91c9e27 Compare January 14, 2026 18:49

akien-mga mentioned this pull request Jan 14, 2026

Revert "Core: Switch RID_alloc::owns to lock-free" #114963

Merged

Build: Update to Xcode 26.2

8d0f4f9

stuartcarnie requested a review from a team as a code owner January 14, 2026 22:50

RandomShaper mentioned this pull request Jan 15, 2026

RID_Owner: Fix race conditions due to missing synchronization #114980

Closed

akien-mga closed this Jan 16, 2026

akien-mga added the archived label Jan 16, 2026

akien-mga removed this from the 4.6 milestone Jan 16, 2026

Uh oh!

Conversation

stuartcarnie commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The Bug

Original Code (Broken)

The Race Conditions

Why WEAK_MEMORY_ORDER 1 Didn't Fix It

Problem 1: Fence Position

Problem 2: No Synchronization on Validator

The Fix

Change 1: Release Store for Validator in _allocate_rid()

Change 2: Acquire Load for Validator in get_or_null() and owns()

Change 3: Release Store in get_or_null() When Clearing Init Bit

Change 4: Retain max_alloc Release/Acquire

Complete Synchronization Model

How This Works on Different Architectures

Strongly-Ordered (x86/x64)

Weakly-Ordered (ARM/Apple Silicon)

Why Lock-Free Instead of Mutex?

TSAN Annotations Removed

References

Uh oh!

Calinou left a comment

Choose a reason for hiding this comment

Uh oh!

stuartcarnie commented Jan 14, 2026

Uh oh!

stuartcarnie commented Jan 14, 2026

Uh oh!

bruvzg left a comment

Choose a reason for hiding this comment

Uh oh!

blueskythlikesclouds commented Jan 14, 2026

Uh oh!

Alex2782 commented Jan 14, 2026

Uh oh!

stuartcarnie commented Jan 14, 2026

Uh oh!

stuartcarnie commented Jan 14, 2026

Uh oh!

akien-mga commented Jan 14, 2026

Uh oh!

stuartcarnie commented Jan 14, 2026

Uh oh!

bruvzg commented Jan 14, 2026

Uh oh!

stuartcarnie commented Jan 14, 2026

Uh oh!

stuartcarnie commented Jan 14, 2026

Uh oh!

stuartcarnie commented Jan 14, 2026

Uh oh!

bruvzg commented Jan 14, 2026

Uh oh!

stuartcarnie commented Jan 14, 2026

Uh oh!

bruvzg commented Jan 14, 2026

Uh oh!

blueskythlikesclouds commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akien-mga commented Jan 14, 2026

Uh oh!

stuartcarnie commented Jan 14, 2026

Uh oh!

akien-mga commented Jan 14, 2026

Uh oh!

stuartcarnie commented Jan 14, 2026

Uh oh!

stuartcarnie commented Jan 14, 2026

Uh oh!

jeffuntildeath commented Jan 15, 2026

Uh oh!

stuartcarnie commented Jan 15, 2026

Uh oh!

stuartcarnie commented Jan 15, 2026

Uh oh!

jeffuntildeath commented Jan 15, 2026 via email

Uh oh!

stuartcarnie commented Jan 14, 2026 •

edited

Loading

Why `WEAK_MEMORY_ORDER 1` Didn't Fix It

Change 1: Release Store for Validator in `_allocate_rid()`

Change 2: Acquire Load for Validator in `get_or_null()` and `owns()`

Change 3: Release Store in `get_or_null()` When Clearing Init Bit

Change 4: Retain `max_alloc` Release/Acquire

blueskythlikesclouds commented Jan 14, 2026 •

edited

Loading

Alex2782 commented Jan 15, 2026 •

edited

Loading

RandomShaper commented Jan 15, 2026 •

edited

Loading

akien-mga commented Jan 15, 2026 •

edited

Loading