- 
                Notifications
    You must be signed in to change notification settings 
- Fork 47
Integrates Hostmesh v1 and RDMA #378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Codecov Report❌ Patch coverage is  Additional details and impacted files@@           Coverage Diff           @@
##             main     #378   +/-   ##
=======================================
  Coverage        ?   64.44%           
=======================================
  Files           ?       79           
  Lines           ?     7735           
  Branches        ?        0           
=======================================
  Hits            ?     4985           
  Misses          ?     2750           
  Partials        ?        0           ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but should probably get a Monarch person's eyes on this.
| dataset: | ||
| procs: 1 | ||
| with_gpus: false | ||
| mesh_name: dataset | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These have to be manually specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's probably a way to do this in a not silly way but alas
| from monarch._src.actor.v1.host_mesh import this_proc | ||
| from monarch._src.actor.v1.proc_mesh import get_or_spawn_controller | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove this in lieu of the public APIs which have moved to v1 by default.
* start adding env constants * adds env_constants * get_value on env var * makes changes in the rest of the codebase * final fix * a few more cleanups * merge fixes * test fix * initial commit for hostmesh v1 * more updates for testing * changes for running * extent * replace monarch build * rename wheel * fix * fix docs * grpo main still runs * fix docs build * rename ci wheel * fix the wheel path * wheel name * documentation * fix remote_setup for MAST
What this PR does:
Extentnor HostMesh stuff)MONARCH_HOSTMESH_V1environment variable, which defaults to False.use_dcpis now controlled by whether or not RDMA is being used.To run:
This won't work until changes in TorchStore lands. Until then, the normal path works (i.e. defaults are
TORCHSTORE_RDMA_ENABLED=0andMONARCH_HOSTMESH_V1=0)Proof of 32B running with this setup: