Fix possible hang when using torch.multiprocessing #6271

jleibs · 2024-05-08T21:41:00Z

What

Resolves: Hang when using torch.multiprocessing after having called rr.init #6223

Although we already have a fork-handler that cleans up our global/thread-local recording streams, it's apparently still possible for an allocated PyRecordingStream to leak into the subprocess via fork (at least based on how pytorch multiprocessing works).

During __del__, we make one last call to a non-blocking flush.

While previously this was fine, we added an internal blocking batcher flush to our non-blocking sink flush, which still hangs for the same reason (the batcher processing thread is gone in the fork).

The fixes:

Add a new unit test that repros the error.
Check for is_forked_child in all the methods that issue inner.batcher.flush_blocking()
Check for is_forked_child from __del__ and don't call flush to avoid getting a gratuitous warning printout.

Checklist

I have read and agree to Contributor Guide and the Code of Conduct
I've included a screenshot or gif (if applicable)
I have tested the web demo (if applicable):
- Using examples from latest main build: rerun.io/viewer
- Using full set of examples from nightly build: rerun.io/viewer
The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG
If applicable, add a new check to the release checklist!

To run all checks from main, comment on the PR with @rerun-bot full-check.

rerun_py/src/python_bridge.rs

Co-authored-by: Clement Rey <[email protected]>

jleibs added 4 commits May 8, 2024 17:05

New unit test reproducing the error

ac0d717

Check for forked child in more places

bce6296

Don't call the flush if destructing in a forked process

b211c56

Make mypy happy

efd8387

jleibs added 🐍 Python API Python logging API 🪳 bug Something isn't working include in changelog labels May 8, 2024

jleibs added this to the 0.16 milestone May 8, 2024

jleibs marked this pull request as ready for review May 8, 2024 21:41

jleibs added 2 commits May 8, 2024 17:52

Fix docstring

936f446

Comment in test

8b8e6d4

jleibs force-pushed the jleibs/fix_hang branch from 8e7c293 to 8b8e6d4 Compare May 8, 2024 21:55

teh-cmc self-requested a review May 13, 2024 09:25

teh-cmc approved these changes May 13, 2024

View reviewed changes

rerun_py/src/python_bridge.rs Outdated Show resolved Hide resolved

Update rerun_py/src/python_bridge.rs

819454c

Co-authored-by: Clement Rey <[email protected]>

jleibs merged commit af377b4 into main May 13, 2024
19 of 20 checks passed

jleibs deleted the jleibs/fix_hang branch May 13, 2024 12:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix possible hang when using torch.multiprocessing #6271

Fix possible hang when using torch.multiprocessing #6271

jleibs commented May 8, 2024 •

edited by github-actions bot

Loading

Fix possible hang when using torch.multiprocessing #6271

Fix possible hang when using torch.multiprocessing #6271

Conversation

jleibs commented May 8, 2024 • edited by github-actions bot Loading

What

The fixes:

Checklist

jleibs commented May 8, 2024 •

edited by github-actions bot

Loading