Skip to content

[rl] add import torch to provisioner bootstrap to avoid concurrent dlopen …#3220

Merged
shuhuayu merged 1 commit intopytorch:mainfrom
shuhuayu:opt
May 5, 2026
Merged

[rl] add import torch to provisioner bootstrap to avoid concurrent dlopen …#3220
shuhuayu merged 1 commit intopytorch:mainfrom
shuhuayu:opt

Conversation

@shuhuayu
Copy link
Copy Markdown
Contributor

@shuhuayu shuhuayu commented May 5, 2026

When Monarch spawns sub-processes and multiple actor threads begin unpickling messages concurrently, they all try to import torch at the same time, causing a race condition in dlopen of torch._C.so. This results in the misleading error torch._C is not a package, even though the import works fine when done sequentially. The fix is to import torch in the Provisioner's bootstrap function, which runs once per sub-process before any threading starts, ensuring torch._C.so is fully loaded before concurrent unpickling begins. It's unclear whether this is a PyTorch bug (concurrent dlopen should be thread-safe) or a Monarch bug (imports during unpickling should be serialized), so we've added a TODO to remove the workaround once the upstream fix lands.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 5, 2026
@wwwjn
Copy link
Copy Markdown
Contributor

wwwjn commented May 5, 2026

@shuhuayu You can go ahead and land this fix first, the RL integration test failing is fixed in #3041

@shuhuayu
Copy link
Copy Markdown
Contributor Author

shuhuayu commented May 5, 2026

@shuhuayu You can go ahead and land this fix first, the RL integration test failing is fixed in #3041

Sounds good. Tested it locally and passed and i'll merge first.

@shuhuayu shuhuayu merged commit 2ae1340 into pytorch:main May 5, 2026
7 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants