Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing with start-method "fork" results in hang on shutdown #1921

Closed
jleibs opened this issue Apr 19, 2023 · 2 comments · Fixed by #2676
Closed

Multiprocessing with start-method "fork" results in hang on shutdown #1921

jleibs opened this issue Apr 19, 2023 · 2 comments · Fixed by #2676
Assignees
Labels
🪳 bug Something isn't working 🐧 linux Linux-specific problems 🐍 Python API Python logging API user-request This is a pressing issue for one of our users

Comments

@jleibs
Copy link
Member

jleibs commented Apr 19, 2023

Can be reproduced with the multiprocessing demo.

Current workaround is to always force "spawn":

multiprocessing.set_start_method("spawn")

However, needing to do this will definitely bite Linux users since "fork" is the default.

See: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

@jleibs jleibs added 🪳 bug Something isn't working 🐧 linux Linux-specific problems labels Apr 19, 2023
@emilk emilk added the 🐍 Python API Python logging API label Apr 19, 2023
@teh-cmc
Copy link
Member

teh-cmc commented May 9, 2023

I can sadly confirm that even all the changes in #2061 did not fix that.

@emilk
Copy link
Member

emilk commented Jun 14, 2023

We have a user that is affected by this; they can't use spawn, so they can't update Rerun right now.

@emilk emilk added the user-request This is a pressing issue for one of our users label Jun 27, 2023
@jleibs jleibs self-assigned this Jul 11, 2023
jleibs added a commit that referenced this issue Jul 12, 2023
Closes #1921

### What

The crux of the problem is the following:
> The child process is created with a single thread—the one that called
fork(). The entire virtual address space of the parent is replicated in
the child, ...

The major consequence of this is that our global `RecordingStream`
context is duplicated into the child memory space but none of the
threads (batcher, tcp-sender, dropper, etc.) are duplicated. When we go
to call `connect()` inside the forked process, we try to replace the
global recording-stream, which subsequently tries to call drop on the
forked copy of `RecordingStreamInner` . However, without any existing
threads to process the flush, things just hang inside that flush call.

We take a few actions to alleviate this problem:
1. Introduce a new SDK function: `cleanup_if_forked` which compares the
process-ids on existing globals and forgets them as necessary.
1. In python, use `os.register_at_fork` to proactively call
`cleanup_if_forked` in any forked child processes.
1. Also add a call to `cleanup_if_forked` inside of init() in case we're
forking through a more exotic mechanism.
1. Check for the forked state anywhere we potentially flush to avoid
deadlocks and produce a visible user-error.

Additionally, it turns out that forked processes bypass the normal
python `atexit` handler which means we don't get proper shutdown/flush
behavior when the forked processes terminate. To help users workaround
this, we introduce a `@shutdown_at_exit` decorator which can be used to
decorate functions launched via multiprocessing.

### Testing

On linux:
```
$ python examples/python/multiprocessing/main.py
```
observe demo exits cleanly and all data shows in viewer.

### Checklist
* [x] I have read and agree to [Contributor
Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and
the [Code of
Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md)
* [x] I've included a screenshot or gif (if applicable)
* [x] I have tested [demo.rerun.io](https://demo.rerun.io/pr/2676) (if
applicable)

- [PR Build Summary](https://build.rerun.io/pr/2676)
- [Docs
preview](https://rerun.io/preview/pr%3Ajleibs%2Fcleanup_if_forked/docs)
- [Examples
preview](https://rerun.io/preview/pr%3Ajleibs%2Fcleanup_if_forked/examples)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪳 bug Something isn't working 🐧 linux Linux-specific problems 🐍 Python API Python logging API user-request This is a pressing issue for one of our users
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants