Core: Fix RID_Alloc Lock-Free Reads on Weakly-Ordered Architectures#114937
Core: Fix RID_Alloc Lock-Free Reads on Weakly-Ordered Architectures#114937stuartcarnie wants to merge 2 commits intogodotengine:masterfrom
Conversation
Calinou
left a comment
There was a problem hiding this comment.
Tested locally, it works as expected. Code looks good to me.
However, I couldn't reproduce the original bug, so additional testing is needed.
Here are macOS arm64 release export template binaries for testing:
master: 114937_master.zip- This PR: 114937_pr.zip
SCons options used for both binaries:
target=template_release optimize=speed disable_path_overrides=no
|
@Calinou those links aren't working for me |
a261c27 to
8f0f066
Compare
|
Not fixed with my changes; will revert and reevaluate. |
8f0f066 to
023fcfc
Compare
bruvzg
left a comment
There was a problem hiding this comment.
I can't reproduce the issue (M4 Pro, release template build), but using release-acquire ordering seems to make sense.
Same on my M4 Pro, I tried master and this PR, both cause the particle to stay at origin. |
|
When I compile Godot Editor myself on my MacMini M1, I cannot reproduce the issues with So far I have been able to reproduce the issue with the official This should affect any official version created from November 12th onwards (dev4+). |
Did you try with commit 023fcfc? |
|
With commit 023fcfc I am still able to reproduce with CI/CD build 🤔 |
Then at least I'm happy to see it's reproducible at least in some cases with a proper Xcode build, and isn't specific to our official builds made with osxcross. I was starting to worry we'd have to dig deep into that one again. |
|
Looking more closely at the code, the validator access is not synchronised properly with the I'd say we revert #112657, as @akien-mga suggested and revisit for 4.7. We should open a separate PR and I'll use this one to working on a complete fix. |
|
For the reference, I can't reproduce it with artifact from this PR or RC1 template as well. |
|
@bruvzg what version of macOS? |
Updated `_allocate_rid` to use release when storing max, ensuring earlier stores cannot be reordered **after**. Update of `owns` and `get_or_null` using acquire when loading max, ensuring later reads cannot be ordered **before**.
023fcfc to
91c9e27
Compare
|
We weren't using release / acquire semantics for the Also note that |
|
@blueskythlikesclouds can you try with my latest commit? |
26.2 (25C56) |
Same as me 👍🏻 |
|
Additionally tested it on VM with 15.7.1 (24G231), same results, can't reproduce it with any version. |
Still happening on my end. |
I opened #114963 to revert #112657, in case this still seems to be the best course of action. Would be good if folks who can reproduce the bug reliably can test #114963 to confirm that it does solve the problem. |
|
If I build a release locally with Xcode, I'm not able to reproduce the issue, but pulling down the CI build does. 🤷🏻 |
|
CI builds are made with Xcode 26.0.1: godot/.github/workflows/macos_builds.yml Lines 40 to 42 in 481f36e Official builds are using SDKs from Xcode 26.1.1: https://github.com/godotengine/build-containers?tab=readme-ov-file#toolchains Maybe it's something that got fixed in Xcode 26.2? |
|
I'll try updating the Xcode image to see if that makes a difference. |
|
@akien-mga your suspicion was correct – Xcode 26.2 resolves the issue! @blueskythlikesclouds, et al, can you try this binary: https://github.com/godotengine/godot/actions/runs/21012696146/artifacts/5133991403 |
|
I built and tested the patch with Xcode 26.2 and can still reproduce it |
|
I'm building off my CMake project; so now I need to figure out what is different than building with |
|
@jeffuntildeath were you reproducing with your original mrp? |
|
Yes, same project.
…On Wed, Jan 14, 2026, 6:51 PM Stuart Carnie ***@***.***> wrote:
*stuartcarnie* left a comment (godotengine/godot#114937)
<#114937 (comment)>
@jeffuntildeath <https://github.com/jeffuntildeath> were you reproducing
with your original mrp?
—
Reply to this email directly, view it on GitHub
<#114937 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASBUKHMLAHQYDEK5RVLKQ2L4G3QAFAVCNFSM6AAAAACRTTSH3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTONJSGM3DIMRTGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
@jeffuntildeath would you mind trying this build: https://github.com/godotengine/godot/actions/runs/21012696146/artifacts/5133991403 |
How exactly did you execute scons? @jeffuntildeath |
|
same: executing 'scons' with no arguments |
|
@stuartcarnie Yes, it is reproducible using the build you supplied. |
|
Let me give my assessment here. The fact that the loads and stores to With that in mind, I'd say the root issue here is that the stray fences don't synchronize with the acquire-release of the mutex. I can state some different fixes for that:
@akien-mga, would you like me to provide PRs for any of these approaches? |
Here's a build of 4.6-rc1 with Xcode 26.2 made with the same toolchain as the official 4.6-rc1 build using Xcode 26.1.1: https://github.com/akien-mga/godot/releases/tag/akien%2F4.6-rc1-xcode26.2 Would be good to get this tested to see if Xcode 26.2 does improve the situation, though early reports from @jeffuntildeath using the CI builds seem to infer that it may not. I'm starting to wonder if there's some UB here where it might behave differently not just based on the compiler, but on ASLR / memory mapping etc. Another question is whether this bug is reproducible on iOS, or only on macOS? @RandomShaper I'm not qualified to assess which option makes the most sense, but given the time (we're at RC1 and we shouldn't make any risky core change at this time), I think we'll likely go ahead with reverting #112657. For 4.7, we should retry this optimization and maybe include the approach you think is best to fix it fully. We'll have to benchmark to see whether the regression fix for the optimization doesn't undo the performance benefits. Regarding option (c) using C++20, this is not an option right now as we're using C++17 (though I think we do promote to C++20 for Metal so if this is constrained to macOS/iOS, this could work). But if we keep these changes for 4.7 anyway, then C++20 might be a good option as we currently aim to merge #100749 for 4.7. |
|
Theoretically, this bug could bite on any multi-core architecture. However, in practice, it will only cause issues on ARM devices, regardless of it being iOS, MacOS or Android. However, it may be the case that aspects such as the number of cores make it harder to reproduce on some. Reverting the I'm making a PR with what I deem the approach that diverges from the current code the least. It would be great if the people ablle to reproduce the issue could test it... |
|
@akien-mga, I think I'm stepping out of this for now, as my attempts to fix it on #114980 are not being successful. I'll take a look to this again when I have more time to analyze the matter. For the time being, for the next release, if the revert way is good enough, you have my blessing. |
|
Superseded by #114963. |


Fixes #114900
This PR fixes a memory ordering bug in
RID_Allocthat caused data races on weakly-ordered architectures (ARM, Apple Silicon, POWER, RISC-V). The fix adds propermemory_order_release/memory_order_acquiresemantics to bothmax_allocandvalidatoratomic operations.Important
🤖 AI disclosure
I used AI to help identify and fix the issue. I also used AI to help describe the problem in detail, which I reviewed and edited. I have reviewed the code, understand it and tested a release build with the original MRP, Bistro Demo, and can confirm it works with commit 91c9e27.
The Bug
The
RID_Allocclass uses a lock-free read path inget_or_null()andowns()for performance. These functions read shared state without holding the mutex, but the original code usedmemory_order_relaxedwhich provides no ordering guarantees.Original Code (Broken)
The Race Conditions
Race 1: Chunk initialization (when growing)
max_allocwith relaxed orderingmax_allocwith relaxed ordering, sees new valueRace 2: Validator (every allocation)
validator(non-atomic), unlocks mutexmax_alloc, accessesvalidatorThe second race is the critical one—
max_alloconly changes when chunks grow (rare), butvalidatorchanges on every allocation.Why
WEAK_MEMORY_ORDER 1Didn't Fix ItThe codebase had a
WEAK_MEMORY_ORDERmacro that usedatomic_thread_fence:This failed for two reasons:
Problem 1: Fence Position
Acquire fences must come after the load, not before.
Problem 2: No Synchronization on Validator
Even with correct
max_allocsynchronization, thevalidatorfield was accessed non-atomically. Themax_allocrelease/acquire only synchronizes chunk structure, not per-slot validator writes.The Fix
Change 1: Release Store for Validator in
_allocate_rid()What
memory_order_releaseguarantees:A release store acts as a one-way barrier that prevents preceding operations from being reordered after it:
This ensures that when another thread observes the validator, all initialization that preceded the store is guaranteed visible.
Change 2: Acquire Load for Validator in
get_or_null()andowns()What
memory_order_acquireguarantees:An acquire load acts as a one-way barrier that prevents subsequent operations from being reordered before it:
Change 3: Release Store in
get_or_null()When Clearing Init BitChange 4: Retain
max_allocRelease/AcquireThe
max_allocsynchronization is still needed for chunk structure visibility:Complete Synchronization Model
Both synchronization points are necessary:
max_alloc: ensures chunk structure is visible (when chunks grow)validator: ensures per-slot allocation is visible (every allocation)How This Works on Different Architectures
Strongly-Ordered (x86/x64)
x86 has Total Store Order (TSO) where stores aren't reordered with stores, and loads aren't reordered with loads. The original
relaxedoperations happened to work at the hardware level, but the compiler could still reorder. Withrelease/acquire, the compiler is constrained and hardware behavior is unchanged.Weakly-Ordered (ARM/Apple Silicon)
ARM permits almost any reordering unless explicitly prevented:
STLRinstruction—all prior writes complete firstLDARinstruction—all subsequent reads waitThe C++ memory model abstracts this, making the code portable across all architectures.
Why Lock-Free Instead of Mutex?
The lock-free read path provides:
owns()validationThe trade-off is complexity, but with proper acquire-release semantics on both
max_allocandvalidator, the code is correct and efficient.TSAN Annotations Removed
The previous code had
__tsan_acquire/__tsan_releaseannotations to suppress ThreadSanitizer warnings. These are no longer needed because the atomic operations with proper memory ordering are recognized by TSAN as correct synchronization.References