Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear global THREAD_CACHE in child after fork() #2764

Open
tmaxwell-anthropic opened this issue Aug 19, 2023 · 7 comments · May be fixed by #3200
Open

Clear global THREAD_CACHE in child after fork() #2764

tmaxwell-anthropic opened this issue Aug 19, 2023 · 7 comments · May be fixed by #3200

Comments

@tmaxwell-anthropic
Copy link

Consider the following bug:

  1. trio.to_thread.run_sync() is called, creating a worker thread. The worker thread is left in the global THREAD_CACHE.
  2. The Python process forks for some reason (perhaps via the multiprocessing module)
  3. The child process now calls trio.to_thread.run_sync(). The global THREAD_CACHE still contains a reference to the worker thread, so the child process thinks it has an idle worker thread, and tries to dispatch a task to it. However, the worker thread doesn't actually exist in the child process. So trio.to_thread.run_sync() hangs forever.

Because THREAD_CACHE is interpreter-global, this can happen even if the two Trio run loop are completely separate. For example, in a test suite, one test might call trio.to_thread.run_sync(), and then later a completely separate test might use multiprocessing to spawn a process that calls trio.to_thread.run_sync().

I think it should be fairly simple to fix this by using os.register_at_fork() to ensure THREAD_CACHE is cleared in the child whenever the interpreter forks.

@TeamSpen210
Copy link
Contributor

There's some previous discussion regarding handling fork() in #1614, though a bunch of the concerns there probably aren't relevant anymore...

@tmaxwell-anthropic
Copy link
Author

tmaxwell-anthropic commented Aug 19, 2023

Note this can happen even if the fork isn't inside trio.run(). For example:

async def foo():
  ... do some async stuff ...
  await trio.to_thread.run_sync(...)
  ... do some async stuff ...

trio.run(foo)
# we are now outside trio.run(), but the worker thread is still in THREAD_CACHE

os.fork()

# this is fine on the parent, but fails on the child, due to the bug
trio.run(foo)

My understanding is it's not really practical to support fork() inside trio.run(), but it generally should be fine to have fork() and trio.run() in the same program at different times -- except for this bug :)

@A5rocks
Copy link
Contributor

A5rocks commented Oct 4, 2023

Do you want to contribute a fix? Otherwise I might try to fix this (though I'm not very comfortable with things related to threads...)

@jakkdl
Copy link
Member

jakkdl commented Jan 31, 2025

Do you want to contribute a fix? Otherwise I might try to fix this (though I'm not very comfortable with things related to threads...)

I don't think @tmaxwell-anthropic is going to, do you want to do it? Otherwise I'll give it a try (and I'm certainly not comfortable with thread things either :P)

@jakkdl
Copy link
Member

jakkdl commented Feb 4, 2025

I can repro THREAD_CACHE not being cleared, but I'm failing to repro any hang:

This, based on #2764 (comment), runs without issue for me (arch linux) on py39-py313

import os
import trio

def blah() -> None:
    pass

async def foo() -> None:
    await trio.to_thread.run_sync(blah)

trio.run(foo)
# we are now outside trio.run(), but the worker thread is still in THREAD_CACHE

os.fork()

# this is fine on the parent, but fails on the child, due to the bug
trio.run(foo)

although I get a deprecationwarning about using threads+forks, see https://docs.python.org/3/library/os.html#os.fork and https://discuss.python.org/t/concerns-regarding-deprecation-of-fork-with-alive-threads/33555
That thread mentions implementation differences, which might be why my repro doesn't hang, but it also talks a lot about how just doing os.register_at_fork() is insufficient to resolve the issue.

Idk how much of that is relevant to Trio, but I would regardless love a concrete repro from @tmaxwell-anthropic. It's possible my repro above would fail if I pushed it to CI on multiple platforms, but haven't tried that.

@tmaxwell-anthropic
Copy link
Author

When I run that script, the child process silently hangs, but the parent process appears to exit successfully. When you say you "failed to repro the hang", is it possible that's what you saw?

Here's a modified script with some additional logging:

import os

import trio

print(f"The parent process is: {os.getpid()=}")


def blah() -> None:
    print(f"blah() is executing in {os.getpid()=}")


async def foo() -> None:
    print(f"foo() starting in {os.getpid()=}")
    await trio.to_thread.run_sync(blah)
    print(f"foo() ending in {os.getpid()=}")


trio.run(foo)
# we are now outside trio.run(), but the worker thread is still in THREAD_CACHE

child_pid = os.fork()
if child_pid == 0:
    print(f"We are the child, {os.getpid()=}")
else:
    print(f"We are the parent, {os.getpid()=} and {child_pid=}")

# this is fine on the parent, but fails on the child, due to the bug
trio.run(foo)

if child_pid != 0:
    print("Parent waiting for child to exit...")
    os.wait4(child_pid, 0)

When I run this on either linux or macos, this prints something like:

The parent process is: os.getpid()=471978
foo() starting in os.getpid()=471978
blah() is executing in os.getpid()=471978
foo() ending in os.getpid()=471978
We are the parent, os.getpid()=471978 and child_pid=471981
We are the child, os.getpid()=471981
foo() starting in os.getpid()=471978
foo() starting in os.getpid()=471981
blah() is executing in os.getpid()=471978
foo() ending in os.getpid()=471978
Parent waiting for child to exit...

Note that the child prints foo() starting in os.getpid()=471981, but never print blah() is executing in os.getpid()=471981 or foo() ending in os.getpid()=471981; the child is hanging on the call to trio.to_thread.run_sync().

Can you repro the problem with that script?

@A5rocks
Copy link
Contributor

A5rocks commented Feb 5, 2025

I can certainly reproduce with your script. I was reading the Python documentation for os.fork and nowadays there's this really funny warning:

If Python is able to detect that your process has multiple threads, os.fork() now raises a DeprecationWarning.

We chose to surface this as a warning, when detectable, to better inform developers of a design problem that the POSIX platform specifically notes as not supported. Even in code that appears to work, it has never been safe to mix threading with os.fork() on POSIX platforms. The CPython runtime itself has always made API calls that are not safe for use in the child process when threads existed in the parent (such as malloc and free).

Users of macOS or users of libc or malloc implementations other than those typically found in glibc to date are among those already more likely to experience deadlocks running such code.

See this discussion on fork being incompatible with threads for technical details of why we’re surfacing this longstanding platform compatibility problem to developers.

This actually appears; running the latest reproducer on Python 3.12 (minus the wait at the end):

$ python -X dev x.py
The parent process is: os.getpid()=678993
foo() starting in os.getpid()=678993
blah() is executing in os.getpid()=678993
foo() ending in os.getpid()=678993
/tmp/tmp.2Xe4eWoYAq/x.py:21: DeprecationWarning: This process (pid=678993) is multi-threaded, use of fork() may lead to deadlocks in the child.
  child_pid = os.fork()
We are the parent, os.getpid()=678993 and child_pid=679002
We are the child, os.getpid()=679002
foo() starting in os.getpid()=678993
foo() starting in os.getpid()=679002
blah() is executing in os.getpid()=678993
foo() ending in os.getpid()=678993

However, trio should still do the right thing (maintain the invariant that thread cache is per process, as documented implied above the definition).

@A5rocks A5rocks linked a pull request Feb 5, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants