Skip to content

Conversation

@colin2328
Copy link
Contributor

@colin2328 colin2328 commented Nov 4, 2025

Differential Revision: D86152013

python -m monarch.actor.torchrun train.py
NCCL version 2.27.5+cuda12.9
[Rank 0] Step 0 loss=-3.3797261714935303
[Rank 1] Step 0 loss=-2.670731544494629
[Rank 0] Step 1 loss=0.08844298124313354
[Rank 1] Step 1 loss=-1.594184398651123
[Rank 0] Step 2 loss=-2.7455830574035645
[Rank 1] Step 2 loss=-0.4763872027397156
[Rank 0] Step 3 loss=-2.7619757652282715
[Rank 1] Step 3 loss=-2.6958324909210205
[Rank 0] Step 4 loss=-1.0318412780761719
[Rank 1] Step 4 loss=-3.984520673751831

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 4, 2025
@meta-codesync
Copy link

meta-codesync bot commented Nov 4, 2025

@colin2328 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86152013.

…oc manager (meta-pytorch#1750)

Summary:

adds a new monarch.actor.torchrun module. with a torchrun interface.
https://fb.workplace.com/groups/996292674996363/permalink/1318076819484612/

torchx app def is next diff D86155019

Differential Revision: D86152013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant