add monarch.torchrun module; replicate torchrun using monarch as a proc manager #1750
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Differential Revision: D86152013
python -m monarch.actor.torchrun train.py
NCCL version 2.27.5+cuda12.9
[Rank 0] Step 0 loss=-3.3797261714935303
[Rank 1] Step 0 loss=-2.670731544494629
[Rank 0] Step 1 loss=0.08844298124313354
[Rank 1] Step 1 loss=-1.594184398651123
[Rank 0] Step 2 loss=-2.7455830574035645
[Rank 1] Step 2 loss=-0.4763872027397156
[Rank 0] Step 3 loss=-2.7619757652282715
[Rank 1] Step 3 loss=-2.6958324909210205
[Rank 0] Step 4 loss=-1.0318412780761719
[Rank 1] Step 4 loss=-3.984520673751831