Skip to content

Core: Fix RID_Alloc Lock-Free Reads on Weakly-Ordered Architectures#114937

Closed
stuartcarnie wants to merge 2 commits intogodotengine:masterfrom
stuartcarnie:apple_arm_lock_free
Closed

Core: Fix RID_Alloc Lock-Free Reads on Weakly-Ordered Architectures#114937
stuartcarnie wants to merge 2 commits intogodotengine:masterfrom
stuartcarnie:apple_arm_lock_free

Conversation

@stuartcarnie
Copy link
Contributor

@stuartcarnie stuartcarnie commented Jan 14, 2026

Fixes #114900

This PR fixes a memory ordering bug in RID_Alloc that caused data races on weakly-ordered architectures (ARM, Apple Silicon, POWER, RISC-V). The fix adds proper memory_order_release/memory_order_acquire semantics to both max_alloc and validator atomic operations.

Important

🤖 AI disclosure

I used AI to help identify and fix the issue. I also used AI to help describe the problem in detail, which I reviewed and edited. I have reviewed the code, understand it and tested a release build with the original MRP, Bistro Demo, and can confirm it works with commit 91c9e27.

The Bug

The RID_Alloc class uses a lock-free read path in get_or_null() and owns() for performance. These functions read shared state without holding the mutex, but the original code used memory_order_relaxed which provides no ordering guarantees.

Original Code (Broken)

// In _allocate_rid() - writer side
chunks[chunk_count][i].validator = 0xFFFFFFFF;     // 1. Initialize chunk (rare, when growing)
((std::atomic<uint32_t> *)&max_alloc)->store(...,
    std::memory_order_relaxed);                     // 2. Publish chunk (NO ordering)
chunks[idx][elem].validator = validator | 0x80000000; // 3. Set validator (EVERY allocation)
mutex.unlock();

// In get_or_null() / owns() - reader side
SYNC_ACQUIRE;  // ← Fence BEFORE load (wrong position!)
ma = ((std::atomic<uint32_t> *)&max_alloc)->load(
    std::memory_order_relaxed);                     // ← NO ordering guarantee
// ... access validator (non-atomic read!)

The Race Conditions

Race 1: Chunk initialization (when growing)

  1. Thread A initializes chunk data, stores max_alloc with relaxed ordering
  2. Thread B loads max_alloc with relaxed ordering, sees new value
  3. Thread B accesses chunk but sees stale/uninitialized data

Race 2: Validator (every allocation)

  1. Thread A writes validator (non-atomic), unlocks mutex
  2. Thread B loads max_alloc, accesses validator
  3. Thread B sees stale validator because there's no synchronization

The second race is the critical one—max_alloc only changes when chunks grow (rare), but validator changes on every allocation.

Why WEAK_MEMORY_ORDER 1 Didn't Fix It

The codebase had a WEAK_MEMORY_ORDER macro that used atomic_thread_fence:

#define SYNC_ACQUIRE std::atomic_thread_fence(std::memory_order_acquire);

This failed for two reasons:

Problem 1: Fence Position

SYNC_ACQUIRE;  // Fence here...
ma = load(max_alloc, relaxed);  // ...then load (WRONG!)

Acquire fences must come after the load, not before.

Problem 2: No Synchronization on Validator

Even with correct max_alloc synchronization, the validator field was accessed non-atomically. The max_alloc release/acquire only synchronizes chunk structure, not per-slot validator writes.

The Fix

Change 1: Release Store for Validator in _allocate_rid()

// Before: two non-atomic writes
chunks[free_chunk][free_element].validator = validator;
chunks[free_chunk][free_element].validator |= 0x80000000;

// After: single atomic release store
((std::atomic<uint32_t> *)&chunks[free_chunk][free_element].validator)
    ->store(validator | 0x80000000, std::memory_order_release);

What memory_order_release guarantees:

A release store acts as a one-way barrier that prevents preceding operations from being reordered after it:

writes to chunks[n]           ─┐
writes to free_list           │  Cannot move below the release store
                              ─┘
         ▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼
   store(validator, release)  [RELEASE BARRIER]
         ▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲

This ensures that when another thread observes the validator, all initialization that preceded the store is guaranteed visible.

Change 2: Acquire Load for Validator in get_or_null() and owns()

// Before: non-atomic read with misplaced fence
SYNC_ACQUIRE;
// ... later ...
if (c.validator != validator) { ... }

// After: atomic acquire load
uint32_t cur_validator;
if constexpr (THREAD_SAFE) {
    cur_validator = ((std::atomic<uint32_t> *)&chunks[idx_chunk][idx_element].validator)
        ->load(std::memory_order_acquire);
} else {
    cur_validator = chunks[idx_chunk][idx_element].validator;
}

What memory_order_acquire guarantees:

An acquire load acts as a one-way barrier that prevents subsequent operations from being reordered before it:

         ▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼
   load(validator, acquire)   [ACQUIRE BARRIER]
         ▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲
                              ─┐
read chunk data                │  Cannot move above the acquire load
                              ─┘

Change 3: Release Store in get_or_null() When Clearing Init Bit

if constexpr (THREAD_SAFE) {
    ((std::atomic<uint32_t> *)&chunks[idx_chunk][idx_element].validator)
        ->store(cur_validator & 0x7FFFFFFF, std::memory_order_release);
} else {
    chunks[idx_chunk][idx_element].validator = cur_validator & 0x7FFFFFFF;
}

Change 4: Retain max_alloc Release/Acquire

The max_alloc synchronization is still needed for chunk structure visibility:

// _allocate_rid() - when growing chunks
((std::atomic<uint32_t> *)&max_alloc)->store(max_alloc + elements_in_chunk, std::memory_order_release);

// get_or_null() / owns()
ma = ((std::atomic<uint32_t> *)&max_alloc)->load(std::memory_order_acquire);

Complete Synchronization Model

Thread A (_allocate_rid):              Thread B (get_or_null/owns):
────────────────────────              ────────────────────────────
[if growing chunks:]
  init chunk data
  store(max_alloc, release)  ────────► load(max_alloc, acquire)
                                              │
store(validator, release)    ────────────────►│
                                              ▼
                                       load(validator, acquire)
                                              │
                                              ▼
                                       access data ✓

Both synchronization points are necessary:

  • max_alloc: ensures chunk structure is visible (when chunks grow)
  • validator: ensures per-slot allocation is visible (every allocation)

How This Works on Different Architectures

Strongly-Ordered (x86/x64)

x86 has Total Store Order (TSO) where stores aren't reordered with stores, and loads aren't reordered with loads. The original relaxed operations happened to work at the hardware level, but the compiler could still reorder. With release/acquire, the compiler is constrained and hardware behavior is unchanged.

Weakly-Ordered (ARM/Apple Silicon)

ARM permits almost any reordering unless explicitly prevented:

  • Release store compiles to STLR instruction—all prior writes complete first
  • Acquire load compiles to LDAR instruction—all subsequent reads wait

The C++ memory model abstracts this, making the code portable across all architectures.

Why Lock-Free Instead of Mutex?

The lock-free read path provides:

  1. No contention: Multiple readers proceed simultaneously
  2. No syscalls: Mutex operations may involve kernel transitions
  3. Better scalability: Critical for frequently-called owns() validation

The trade-off is complexity, but with proper acquire-release semantics on both max_alloc and validator, the code is correct and efficient.

TSAN Annotations Removed

The previous code had __tsan_acquire/__tsan_release annotations to suppress ThreadSanitizer warnings. These are no longer needed because the atomic operations with proper memory ordering are recognized by TSAN as correct synchronization.

References

@stuartcarnie stuartcarnie requested a review from a team as a code owner January 14, 2026 01:47
@clayjohn clayjohn added this to the 4.6 milestone Jan 14, 2026
Copy link
Member

@Calinou Calinou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally, it works as expected. Code looks good to me.

However, I couldn't reproduce the original bug, so additional testing is needed.

Here are macOS arm64 release export template binaries for testing:

SCons options used for both binaries:

target=template_release optimize=speed disable_path_overrides=no

@stuartcarnie
Copy link
Contributor Author

@Calinou those links aren't working for me

@stuartcarnie
Copy link
Contributor Author

Not fixed with my changes; will revert and reevaluate.

@stuartcarnie stuartcarnie changed the title Core: Enable WEAK memory ordering macro on Apple Silicon Core: Fix RID_Alloc Lock-Free Reads on Weakly-Ordered Architectures Jan 14, 2026
Copy link
Member

@bruvzg bruvzg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't reproduce the issue (M4 Pro, release template build), but using release-acquire ordering seems to make sense.

@blueskythlikesclouds
Copy link
Contributor

Not fixed with my changes; will revert and reevaluate.

Same on my M4 Pro, I tried master and this PR, both cause the particle to stay at origin.

@Alex2782
Copy link
Member

When I compile Godot Editor myself on my MacMini M1, I cannot reproduce the issues with scons production=yes with and without optimize=speed.

So far I have been able to reproduce the issue with the official dev4 and beta2 versions.
#114900 (comment)

This should affect any official version created from November 12th onwards (dev4+).
https://godotengine.org/download/archive/

@stuartcarnie
Copy link
Contributor Author

Not fixed with my changes; will revert and reevaluate.

Same on my M4 Pro, I tried master and this PR, both cause the particle to stay at origin.

Did you try with commit 023fcfc?

@stuartcarnie
Copy link
Contributor Author

With commit 023fcfc I am still able to reproduce with CI/CD build 🤔

@akien-mga
Copy link
Member

With commit 023fcfc I am still able to reproduce with CI/CD build 🤔

Then at least I'm happy to see it's reproducible at least in some cases with a proper Xcode build, and isn't specific to our official builds made with osxcross. I was starting to worry we'd have to dig deep into that one again.

@stuartcarnie
Copy link
Contributor Author

Looking more closely at the code, the validator access is not synchronised properly with the _allocate_rid. Mixing mutex and lock-free is challenging.

I'd say we revert #112657, as @akien-mga suggested and revisit for 4.7. We should open a separate PR and I'll use this one to working on a complete fix.

@bruvzg
Copy link
Member

bruvzg commented Jan 14, 2026

For the reference, I can't reproduce it with artifact from this PR or RC1 template as well.

@stuartcarnie
Copy link
Contributor Author

@bruvzg what version of macOS?

Updated `_allocate_rid` to use release when storing max,
ensuring earlier stores cannot be reordered **after**.

Update of `owns` and `get_or_null` using acquire when loading max,
ensuring later reads cannot be ordered **before**.
@stuartcarnie
Copy link
Contributor Author

We weren't using release / acquire semantics for the validator field, which I've updated. I can't reproduce with a local release build. I'll try with CI/CD when it finishes.

Also note that get_or_null is probably broken with the existing code and prior to #112657, as that was based on the implementation of get_or_null.

@stuartcarnie
Copy link
Contributor Author

@blueskythlikesclouds can you try with my latest commit?

@bruvzg
Copy link
Member

bruvzg commented Jan 14, 2026

what version of macOS?

26.2 (25C56)

@stuartcarnie
Copy link
Contributor Author

what version of macOS?

26.2 (25C56)

Same as me 👍🏻

@bruvzg
Copy link
Member

bruvzg commented Jan 14, 2026

Additionally tested it on VM with 15.7.1 (24G231), same results, can't reproduce it with any version.

@blueskythlikesclouds
Copy link
Contributor

blueskythlikesclouds commented Jan 14, 2026

@blueskythlikesclouds can you try with my latest commit?

Still happening on my end.

@akien-mga
Copy link
Member

I'd say we revert #112657, as @akien-mga suggested and revisit for 4.7. We should open a separate PR and I'll use this one to working on a complete fix.

I opened #114963 to revert #112657, in case this still seems to be the best course of action.

Would be good if folks who can reproduce the bug reliably can test #114963 to confirm that it does solve the problem.

@stuartcarnie
Copy link
Contributor Author

If I build a release locally with Xcode, I'm not able to reproduce the issue, but pulling down the CI build does. 🤷🏻

@akien-mga
Copy link
Member

CI builds are made with Xcode 26.0.1:

# From https://github.com/actions/runner-images/blob/main/images/macos
- name: Select Xcode 26
run: sudo xcode-select -s /Applications/Xcode_26.0.1.app

Official builds are using SDKs from Xcode 26.1.1:

https://github.com/godotengine/build-containers?tab=readme-ov-file#toolchains

Maybe it's something that got fixed in Xcode 26.2?

@stuartcarnie
Copy link
Contributor Author

I'll try updating the Xcode image to see if that makes a difference.

@stuartcarnie stuartcarnie requested a review from a team as a code owner January 14, 2026 22:50
@stuartcarnie
Copy link
Contributor Author

@akien-mga your suspicion was correct – Xcode 26.2 resolves the issue!

@blueskythlikesclouds, et al, can you try this binary: https://github.com/godotengine/godot/actions/runs/21012696146/artifacts/5133991403

@jeffuntildeath
Copy link
Contributor

I built and tested the patch with Xcode 26.2 and can still reproduce it

@stuartcarnie
Copy link
Contributor Author

I'm building off my CMake project; so now I need to figure out what is different than building with scons

@stuartcarnie
Copy link
Contributor Author

@jeffuntildeath were you reproducing with your original mrp?

@jeffuntildeath
Copy link
Contributor

jeffuntildeath commented Jan 15, 2026 via email

@stuartcarnie
Copy link
Contributor Author

@Alex2782
Copy link
Member

Alex2782 commented Jan 15, 2026

I built and tested the patch with Xcode 26.2 and can still reproduce it

How exactly did you execute scons? @jeffuntildeath
I could only reproduce the issue with official versions (Xcode 26.1.1), not when compiling myself. (master branch)

xcodebuild -version          
Xcode 26.2
Build version 17C52

@jeffuntildeath
Copy link
Contributor

@Alex2782

same:
xcodebuild -version
Xcode 26.2
Build version 17C52

executing 'scons' with no arguments

@jeffuntildeath
Copy link
Contributor

@stuartcarnie Yes, it is reproducible using the build you supplied.

@RandomShaper
Copy link
Member

RandomShaper commented Jan 15, 2026

Let me give my assessment here. The fact that the loads and stores to max_alloc are relaxed is very intentional. The idea is that, provided the mutex and fences are working correctly, that variable doesn't need specific synchronization. The operation is using atomics precisely for that, atomicity, because such variable can be written and read concurrently and we wouldn't want a half-written value in any case (provided it's aligned, probably no real-world architecture would fall into that and so the code generated is the same as for regular, non-atomic operations). That's the rationale behind this whole approach.

With that in mind, I'd say the root issue here is that the stray fences don't synchronize with the acquire-release of the mutex. I can state some different fixes for that:

  • a) Instead of using a Mutex, we'd use a SpinLock. With that, we would replace the fences with atomic acquire/release operation on its locked variable, which would obviously synchronize with the locking and unlocking operations, as they work on that very variable.
  • b) Keep the Mutex, but also issue acquire/release fences around the calls to lock and unlock it. That way, synchronization among every piece of logic involved would happen. We'd thus be using the mutex solely because of its critical section feature, with its acquire-release semantics being irrelevant. For complete correctness, this would also require making all the relevant loads and stores atomic, even if relaxed.
  • c) Use C++20 to make a Mutex implementation based on atomics and their new wait and notify_one operations. I haven't dealt with this personally, but if I understand correctly, this allows to create a mutex that you can both lock-unlock and lockessly synchronize with, with all those operations synchronizing among them. (For this, we may want to compile just the mutex as C++20.)

@akien-mga, would you like me to provide PRs for any of these approaches?

@akien-mga
Copy link
Member

akien-mga commented Jan 15, 2026

@akien-mga your suspicion was correct – Xcode 26.2 resolves the issue!

Here's a build of 4.6-rc1 with Xcode 26.2 made with the same toolchain as the official 4.6-rc1 build using Xcode 26.1.1:

https://github.com/akien-mga/godot/releases/tag/akien%2F4.6-rc1-xcode26.2
(export templates include macOS and iOS templates)

Would be good to get this tested to see if Xcode 26.2 does improve the situation, though early reports from @jeffuntildeath using the CI builds seem to infer that it may not.

I'm starting to wonder if there's some UB here where it might behave differently not just based on the compiler, but on ASLR / memory mapping etc.

Another question is whether this bug is reproducible on iOS, or only on macOS?


@RandomShaper I'm not qualified to assess which option makes the most sense, but given the time (we're at RC1 and we shouldn't make any risky core change at this time), I think we'll likely go ahead with reverting #112657.

For 4.7, we should retry this optimization and maybe include the approach you think is best to fix it fully. We'll have to benchmark to see whether the regression fix for the optimization doesn't undo the performance benefits.

Regarding option (c) using C++20, this is not an option right now as we're using C++17 (though I think we do promote to C++20 for Metal so if this is constrained to macOS/iOS, this could work). But if we keep these changes for 4.7 anyway, then C++20 might be a good option as we currently aim to merge #100749 for 4.7.

@RandomShaper
Copy link
Member

Theoretically, this bug could bite on any multi-core architecture. However, in practice, it will only cause issues on ARM devices, regardless of it being iOS, MacOS or Android. However, it may be the case that aspects such as the number of cores make it harder to reproduce on some.

Reverting the owns() optimization would still leave an incorrect get_or_null(), as far as I can tell.

I'm making a PR with what I deem the approach that diverges from the current code the least. It would be great if the people ablle to reproduce the issue could test it...

@RandomShaper
Copy link
Member

@akien-mga, I think I'm stepping out of this for now, as my attempts to fix it on #114980 are not being successful. I'll take a look to this again when I have more time to analyze the matter. For the time being, for the next release, if the revert way is good enough, you have my blessing.

@Alex2782
Copy link
Member

Here's a build of 4.6-rc1 with Xcode 26.2 made with the same toolchain as the official 4.6-rc1 build using Xcode 26.1.1:

With this version I can also reproduce the issues on my MacMini M1.

Screenshot image

executing 'scons' with no arguments

Yes, then I can reproduce it, but not if I compile with production=yes, strange.

Bildschirmfoto 2026-01-15 um 17 24 22

@akien-mga
Copy link
Member

Superseded by #114963.

@akien-mga akien-mga closed this Jan 16, 2026
@akien-mga akien-mga removed this from the 4.6 milestone Jan 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPUParticles3D: emitter only draws particle at origin on macOS arm64

9 participants