Clear global THREAD_CACHE in child after fork() #2764

tmaxwell-anthropic · 2023-08-19T01:01:03Z

Consider the following bug:

trio.to_thread.run_sync() is called, creating a worker thread. The worker thread is left in the global THREAD_CACHE.
The Python process forks for some reason (perhaps via the multiprocessing module)
The child process now calls trio.to_thread.run_sync(). The global THREAD_CACHE still contains a reference to the worker thread, so the child process thinks it has an idle worker thread, and tries to dispatch a task to it. However, the worker thread doesn't actually exist in the child process. So trio.to_thread.run_sync() hangs forever.

Because THREAD_CACHE is interpreter-global, this can happen even if the two Trio run loop are completely separate. For example, in a test suite, one test might call trio.to_thread.run_sync(), and then later a completely separate test might use multiprocessing to spawn a process that calls trio.to_thread.run_sync().

I think it should be fairly simple to fix this by using os.register_at_fork() to ensure THREAD_CACHE is cleared in the child whenever the interpreter forks.

The text was updated successfully, but these errors were encountered:

TeamSpen210 · 2023-08-19T08:05:40Z

There's some previous discussion regarding handling fork() in #1614, though a bunch of the concerns there probably aren't relevant anymore...

tmaxwell-anthropic · 2023-08-19T19:54:13Z

Note this can happen even if the fork isn't inside trio.run(). For example:

async def foo():
  ... do some async stuff ...
  await trio.to_thread.run_sync(...)
  ... do some async stuff ...

trio.run(foo)
# we are now outside trio.run(), but the worker thread is still in THREAD_CACHE

os.fork()

# this is fine on the parent, but fails on the child, due to the bug
trio.run(foo)

My understanding is it's not really practical to support fork() inside trio.run(), but it generally should be fine to have fork() and trio.run() in the same program at different times -- except for this bug :)

A5rocks · 2023-10-04T00:48:56Z

Do you want to contribute a fix? Otherwise I might try to fix this (though I'm not very comfortable with things related to threads...)

jakkdl · 2025-01-31T15:52:48Z

Do you want to contribute a fix? Otherwise I might try to fix this (though I'm not very comfortable with things related to threads...)

I don't think @tmaxwell-anthropic is going to, do you want to do it? Otherwise I'll give it a try (and I'm certainly not comfortable with thread things either :P)

jakkdl · 2025-02-04T11:10:19Z

I can repro THREAD_CACHE not being cleared, but I'm failing to repro any hang:

This, based on #2764 (comment), runs without issue for me (arch linux) on py39-py313

import os
import trio

def blah() -> None:
    pass

async def foo() -> None:
    await trio.to_thread.run_sync(blah)

trio.run(foo)
# we are now outside trio.run(), but the worker thread is still in THREAD_CACHE

os.fork()

# this is fine on the parent, but fails on the child, due to the bug
trio.run(foo)

although I get a deprecationwarning about using threads+forks, see https://docs.python.org/3/library/os.html#os.fork and https://discuss.python.org/t/concerns-regarding-deprecation-of-fork-with-alive-threads/33555
That thread mentions implementation differences, which might be why my repro doesn't hang, but it also talks a lot about how just doing os.register_at_fork() is insufficient to resolve the issue.

Idk how much of that is relevant to Trio, but I would regardless love a concrete repro from @tmaxwell-anthropic. It's possible my repro above would fail if I pushed it to CI on multiple platforms, but haven't tried that.

tmaxwell-anthropic · 2025-02-04T17:38:15Z

When I run that script, the child process silently hangs, but the parent process appears to exit successfully. When you say you "failed to repro the hang", is it possible that's what you saw?

Here's a modified script with some additional logging:

import os

import trio

print(f"The parent process is: {os.getpid()=}")


def blah() -> None:
    print(f"blah() is executing in {os.getpid()=}")


async def foo() -> None:
    print(f"foo() starting in {os.getpid()=}")
    await trio.to_thread.run_sync(blah)
    print(f"foo() ending in {os.getpid()=}")


trio.run(foo)
# we are now outside trio.run(), but the worker thread is still in THREAD_CACHE

child_pid = os.fork()
if child_pid == 0:
    print(f"We are the child, {os.getpid()=}")
else:
    print(f"We are the parent, {os.getpid()=} and {child_pid=}")

# this is fine on the parent, but fails on the child, due to the bug
trio.run(foo)

if child_pid != 0:
    print("Parent waiting for child to exit...")
    os.wait4(child_pid, 0)

When I run this on either linux or macos, this prints something like:

The parent process is: os.getpid()=471978
foo() starting in os.getpid()=471978
blah() is executing in os.getpid()=471978
foo() ending in os.getpid()=471978
We are the parent, os.getpid()=471978 and child_pid=471981
We are the child, os.getpid()=471981
foo() starting in os.getpid()=471978
foo() starting in os.getpid()=471981
blah() is executing in os.getpid()=471978
foo() ending in os.getpid()=471978
Parent waiting for child to exit...

Note that the child prints foo() starting in os.getpid()=471981, but never print blah() is executing in os.getpid()=471981 or foo() ending in os.getpid()=471981; the child is hanging on the call to trio.to_thread.run_sync().

Can you repro the problem with that script?

A5rocks · 2025-02-05T06:10:45Z

I can certainly reproduce with your script. I was reading the Python documentation for os.fork and nowadays there's this really funny warning:

If Python is able to detect that your process has multiple threads, os.fork() now raises a DeprecationWarning.

We chose to surface this as a warning, when detectable, to better inform developers of a design problem that the POSIX platform specifically notes as not supported. Even in code that appears to work, it has never been safe to mix threading with os.fork() on POSIX platforms. The CPython runtime itself has always made API calls that are not safe for use in the child process when threads existed in the parent (such as malloc and free).

Users of macOS or users of libc or malloc implementations other than those typically found in glibc to date are among those already more likely to experience deadlocks running such code.

See this discussion on fork being incompatible with threads for technical details of why we’re surfacing this longstanding platform compatibility problem to developers.

This actually appears; running the latest reproducer on Python 3.12 (minus the wait at the end):

$ python -X dev x.py
The parent process is: os.getpid()=678993
foo() starting in os.getpid()=678993
blah() is executing in os.getpid()=678993
foo() ending in os.getpid()=678993
/tmp/tmp.2Xe4eWoYAq/x.py:21: DeprecationWarning: This process (pid=678993) is multi-threaded, use of fork() may lead to deadlocks in the child.
  child_pid = os.fork()
We are the parent, os.getpid()=678993 and child_pid=679002
We are the child, os.getpid()=679002
foo() starting in os.getpid()=678993
foo() starting in os.getpid()=679002
blah() is executing in os.getpid()=678993
foo() ending in os.getpid()=678993

However, trio should still do the right thing (maintain the invariant that thread cache is per process, as ~~documented~~ implied above the definition).

A5rocks mentioned this issue Dec 31, 2024

Is it okay to use multiple trio.run calls within an rq task function that uses beanie? #3179

Closed

A5rocks linked a pull request Feb 5, 2025 that will close this issue

Clear worker thread cache after forking #3200

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear global THREAD_CACHE in child after fork() #2764

Clear global THREAD_CACHE in child after fork() #2764

tmaxwell-anthropic commented Aug 19, 2023

TeamSpen210 commented Aug 19, 2023

tmaxwell-anthropic commented Aug 19, 2023 •

edited

Loading

A5rocks commented Oct 4, 2023

jakkdl commented Jan 31, 2025

jakkdl commented Feb 4, 2025

tmaxwell-anthropic commented Feb 4, 2025

A5rocks commented Feb 5, 2025 •

edited

Loading

Clear global THREAD_CACHE in child after fork() #2764

Clear global THREAD_CACHE in child after fork() #2764

Comments

tmaxwell-anthropic commented Aug 19, 2023

TeamSpen210 commented Aug 19, 2023

tmaxwell-anthropic commented Aug 19, 2023 • edited Loading

A5rocks commented Oct 4, 2023

jakkdl commented Jan 31, 2025

jakkdl commented Feb 4, 2025

tmaxwell-anthropic commented Feb 4, 2025

A5rocks commented Feb 5, 2025 • edited Loading

tmaxwell-anthropic commented Aug 19, 2023 •

edited

Loading

A5rocks commented Feb 5, 2025 •

edited

Loading