Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix possible hang when using torch.multiprocessing #6271

Merged
merged 7 commits into from
May 13, 2024
Merged

Conversation

jleibs
Copy link
Member

@jleibs jleibs commented May 8, 2024

What

Although we already have a fork-handler that cleans up our global/thread-local recording streams, it's apparently still possible for an allocated PyRecordingStream to leak into the subprocess via fork (at least based on how pytorch multiprocessing works).

During __del__, we make one last call to a non-blocking flush.

While previously this was fine, we added an internal blocking batcher flush to our non-blocking sink flush, which still hangs for the same reason (the batcher processing thread is gone in the fork).

The fixes:

  • Add a new unit test that repros the error.
  • Check for is_forked_child in all the methods that issue inner.batcher.flush_blocking()
  • Check for is_forked_child from __del__ and don't call flush to avoid getting a gratuitous warning printout.

Checklist

  • I have read and agree to Contributor Guide and the Code of Conduct
  • I've included a screenshot or gif (if applicable)
  • I have tested the web demo (if applicable):
  • The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG
  • If applicable, add a new check to the release checklist!

To run all checks from main, comment on the PR with @rerun-bot full-check.

@jleibs jleibs added 🐍 Python API Python logging API 🪳 bug Something isn't working include in changelog labels May 8, 2024
@jleibs jleibs added this to the 0.16 milestone May 8, 2024
@jleibs jleibs marked this pull request as ready for review May 8, 2024 21:41
@jleibs jleibs force-pushed the jleibs/fix_hang branch from 8e7c293 to 8b8e6d4 Compare May 8, 2024 21:55
@teh-cmc teh-cmc self-requested a review May 13, 2024 09:25
rerun_py/src/python_bridge.rs Outdated Show resolved Hide resolved
@jleibs jleibs merged commit af377b4 into main May 13, 2024
19 of 20 checks passed
@jleibs jleibs deleted the jleibs/fix_hang branch May 13, 2024 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪳 bug Something isn't working include in changelog 🐍 Python API Python logging API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hang when using torch.multiprocessing after having called rr.init
2 participants