Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some race conditions that happen during initialization #73793

Merged
merged 1 commit into from
Mar 6, 2023
Merged

Fix some race conditions that happen during initialization #73793

merged 1 commit into from
Mar 6, 2023

Conversation

myaaaaaaaaa
Copy link
Contributor

Needed for #73777

@myaaaaaaaaa myaaaaaaaaa marked this pull request as ready for review February 23, 2023 00:22
@myaaaaaaaaa myaaaaaaaaa requested a review from a team as a code owner February 23, 2023 00:22
@myaaaaaaaaa myaaaaaaaaa changed the title Fix data races in startup/teardown Fix some race conditions that happen during initialization Feb 23, 2023
@akien-mga akien-mga added this to the 4.1 milestone Feb 23, 2023
@@ -50,8 +50,8 @@ void Thread::_set_platform_functions(const PlatformFunctions &p_functions) {
platform_functions = p_functions;
}

void Thread::callback(Thread *p_self, const Settings &p_settings, Callback p_callback, void *p_userdata) {
Thread::caller_id = _thread_id_hash(p_self->thread.get_id());
void Thread::callback(ID p_caller_id, const Settings &p_settings, Callback p_callback, void *p_userdata) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about why this is needed. I mean, if the pointer was only needed for getting the id, this new version seems to be safer. However, in practice Thread object shouldn't be destroyed before the thread ends.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a call to thread.swap(new_thread) right after new_thread gets created, so in the old version, it's ambiguous whether p_self->thread refers to the old thread or the just created new_thread.

Threading is hard 🥲

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides swap, Thread technically can be destroyed (and will detach thread in this case) before the thread is finished. So it is safer.

@@ -75,7 +75,7 @@ struct _IP_ResolverPrivate {
Semaphore sem;

Thread thread;
bool thread_abort = false;
SafeFlag thread_abort;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable is synced with the semaphore. However, I see now that a data race is still possible, if bool doesn't provide atomicity guarantees across platforms. Therefore, thinking about this (and similar cases), my abandoned RelaxedFlag project is relevant (#63189). That one asks the implementation for atomicity, but without acquire-release semantics, which the semaphore is already providing.

Thoughts?

P.S.: What was the sanitizer complaint, by the way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a standard read/write complaint, it seems like all variable accesses (even to scalars like int and bool) are considered data races unless they're explicitly specified as atomic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing it is an IP hostname resolver, it's not performance critical and any overhead for SafeFlag going to be negligible compared to networking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this along the lines of having an idiom for this kind of cases; namely, every time there's a semaphore-paced loop with an exit flag, just use the atomic-but-relaxed version of the flag, without having to pay further attention to the performance considerations.

That said, that can be discussed separately. These changes already have their merit as they are now.

Copy link
Member

@bruvzg bruvzg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlikely any of these will cause a real issue, but changes seems good.

@@ -75,7 +75,7 @@ struct _IP_ResolverPrivate {
Semaphore sem;

Thread thread;
bool thread_abort = false;
SafeFlag thread_abort;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing it is an IP hostname resolver, it's not performance critical and any overhead for SafeFlag going to be negligible compared to networking.

@@ -50,8 +50,8 @@ void Thread::_set_platform_functions(const PlatformFunctions &p_functions) {
platform_functions = p_functions;
}

void Thread::callback(Thread *p_self, const Settings &p_settings, Callback p_callback, void *p_userdata) {
Thread::caller_id = _thread_id_hash(p_self->thread.get_id());
void Thread::callback(ID p_caller_id, const Settings &p_settings, Callback p_callback, void *p_userdata) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides swap, Thread technically can be destroyed (and will detach thread in this case) before the thread is finished. So it is safer.

@myaaaaaaaaa
Copy link
Contributor Author

This should be the last one involving initialization. Should I open a new PR or add it to this one?

diff --git a/core/os/thread.cpp b/core/os/thread.cpp
index 92865576f3..d533b1c1e2 100644
--- a/core/os/thread.cpp
+++ b/core/os/thread.cpp
@@ -58,13 +58,19 @@ void Thread::callback(ID p_caller_id, const Settings &p_settings, Callback p_cal
 	if (platform_functions.init) {
 		platform_functions.init();
 	}
-	ScriptServer::thread_enter(); // Scripts may need to attach a stack.
+
+	// Currently causes a race condition due to the fact that threads are
+	// spawned before ScriptServer is even initialized.
+	// This is currently unused since neither GDScript nor C# require
+	// attaching a stack to new threads, so we just disable for now.
+
+	//ScriptServer::thread_enter();
 	if (platform_functions.wrapper) {
 		platform_functions.wrapper(p_callback, p_userdata);
 	} else {
 		p_callback(p_userdata);
 	}
-	ScriptServer::thread_exit();
+	//ScriptServer::thread_exit();
 	if (platform_functions.term) {
 		platform_functions.term();
 	}

All of the possible workarounds for this one get ugly really fast since ScriptServer isn't always initialized, and any changes to main() that can fix this should probably be centralized in #49362 given the high likelihood of breaking something.

In the meantime, disabling thread_enter/exit() means that ThreadSanitizer can go into CI sooner, which would also aid debugging in case a new language that does need it is ever added.

@RandomShaper
Copy link
Member

If no scripting language is using thread enter/exit callbacks (and there are no near future plans for it), I'd even remove them.

However, I'm a bit unsure about the needs that language extensions developed separately may have (JS, Nim, Rust, whatever).

I am wondering if maybe we could somehow apply a compromise solution that patches the current situation to keep it functional and non-racey, even if somewhat hacky. I'm considering something like this:

  • Add a Mutex/SpinLock to ScriptServer so that _language_count and _languages are treated atomically.
  • On thread_enter(), take a snapshot of the language array and count in TLS.
  • On thread_exit() notify only the languages which are in both the current languages array and the TLS copy.
  • Additional thoughts:
    • Comparing current vs. snapshot languages by pointer would not be reliable; a possibility to workaround that would be that unregistered languages just leave a null entry in the array so they can be compared by index, or that the ScriptLanguage classes get some sort of serial number that lets differentiate among instances.
    • Maybe Vector(), with its COW semantics, can somehow make the new code easier?

@myaaaaaaaaa
Copy link
Contributor Author

myaaaaaaaaa commented Feb 24, 2023

  • Comparing current vs. snapshot languages by pointer would not be reliable; a possibility to workaround that would be that unregistered languages just leave a null entry in the array so they can be compared by index, or that the ScriptLanguage classes get some sort of serial number that lets differentiate among instances.
  • Maybe Vector(), with its COW semantics, can somehow make the new code easier?

Changing _languages into an array of Ref<ScriptLanguage>s and making a TLS copy in thread_enter() would probably be the cleanest way to follow through with this approach.

Unfortunately, it wouldn't help the main victim of this race condition, WorkerThreadPool. Since WorkerThreadPool threads are initialized early and live for the entire duration of the program, unlucky initializations could result in all uses of WorkerThreadPool not having a stack.

I don't know if there's any decision that can be taken here that doesn't leave something broken, including just leaving it as-is 🤷

core/os/thread.cpp Outdated Show resolved Hide resolved
@myaaaaaaaaa
Copy link
Contributor Author

The ScriptServer workaround has been split off into #74501

@akien-mga akien-mga merged commit e80ab42 into godotengine:master Mar 6, 2023
@akien-mga
Copy link
Member

Thanks!

@myaaaaaaaaa myaaaaaaaaa deleted the init-race branch March 6, 2023 20:11
@YuriSizov
Copy link
Contributor

Cherry-picked for 4.0.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants