Skip to content

Conversation

@mariusae
Copy link
Member

@mariusae mariusae commented Oct 31, 2025

Stack from ghstack (oldest at bottom):

Basic idea: meshes are just resources like any other

The idea is that we treat meshes like any other resource. We define a common behavior for mesh controllers.

We stand up a controller actor, and ActorMesh::allocate, ProcMesh::spawn, ActorMesh::spawn, etc. -- all they do is issue a CreateOrUpdate. (This makes them nonblocking -- which is our eventual goal -- but we have a few more things to do before we're fully ready for this.)

We can now query states, etc., by just calling GetState on the mesh resource. This is managed by the controller, and returns immediately (nonblocking). The controller is responsible for polling/pushing/keepaliving the underlying resources.

Supervision events

The mesh controller can "raise" supervision events in the following way: when we create a resource, we include a "supervisor". The controller synthesizes and then pushes supervision events to this port. In effect, any time a rank in a mesh enters non-running status, we generate a supervision event. The controller keeps track of which events have been raised so as to not duplicate them.

Rationale

Having a common resource model is very powerful. For one, it establishes a clean separation of concerns (controller responsible for managing a mesh vs. the owner of the mesh), and it allows us to build common tooling for managing all mesh types centered around a common interface (the mesh resource behavior). It also means that 1) we can have common tooling for all resources (a mesh is now named just by its controller actor+Name, and we can query it like any other resource, e.g., with hyper); 2) it integrates cleanly with observability / "Monarchy".

Differential Revision: D85982104

NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on Phabricator!

# Basic idea: meshes are just resources like any other

The idea is that we treat meshes like any other resource. We define a common behavior for mesh controllers.

We stand up a controller actor, and ActorMesh::allocate, ProcMesh::spawn, ActorMesh::spawn, etc. -- all they do is issue a CreateOrUpdate. (This makes them nonblocking -- which is our eventual goal -- but we have a few more things to do before we're fully ready for this.)

We can now query states, etc., by just calling GetState on the *mesh* resource. This is managed by the controller, and returns immediately (nonblocking). The controller is responsible for polling/pushing/keepaliving the underlying resources.

# Supervision events

The mesh controller can "raise" supervision events in the following way: when we create a resource, we include a "supervisor". The controller synthesizes and then pushes supervision events to this port. In effect, any time a rank in a mesh enters non-running status, we generate a supervision event. The controller keeps track of which events have been raised so as to not duplicate them.

# Rationale

Having a common resource model is very powerful. For one, it establishes a clean separation of concerns (controller responsible for managing a mesh vs. the owner of the mesh), and it allows us to build common tooling for managing all mesh types centered around a common interface (the mesh resource behavior). It also means that 1) we can have common tooling for *all* resources (a mesh is now named just by its controller actor+Name, and we can query it like any other resource, e.g., with `hyper`); 2) it integrates cleanly with observability / "Monarchy".

Differential Revision: [D85982104](https://our.internmc.facebook.com/intern/diff/D85982104/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D85982104/)!

[ghstack-poisoned]
mariusae added a commit that referenced this pull request Oct 31, 2025
# Basic idea: meshes are just resources like any other

The idea is that we treat meshes like any other resource. We define a common behavior for mesh controllers.

We stand up a controller actor, and ActorMesh::allocate, ProcMesh::spawn, ActorMesh::spawn, etc. -- all they do is issue a CreateOrUpdate. (This makes them nonblocking -- which is our eventual goal -- but we have a few more things to do before we're fully ready for this.)

We can now query states, etc., by just calling GetState on the *mesh* resource. This is managed by the controller, and returns immediately (nonblocking). The controller is responsible for polling/pushing/keepaliving the underlying resources.

# Supervision events

The mesh controller can "raise" supervision events in the following way: when we create a resource, we include a "supervisor". The controller synthesizes and then pushes supervision events to this port. In effect, any time a rank in a mesh enters non-running status, we generate a supervision event. The controller keeps track of which events have been raised so as to not duplicate them.

# Rationale

Having a common resource model is very powerful. For one, it establishes a clean separation of concerns (controller responsible for managing a mesh vs. the owner of the mesh), and it allows us to build common tooling for managing all mesh types centered around a common interface (the mesh resource behavior). It also means that 1) we can have common tooling for *all* resources (a mesh is now named just by its controller actor+Name, and we can query it like any other resource, e.g., with `hyper`); 2) it integrates cleanly with observability / "Monarchy".

Differential Revision: [D85982104](https://our.internmc.facebook.com/intern/diff/D85982104/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D85982104/)!

ghstack-source-id: 320165967
Pull Request resolved: #1729
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 31, 2025
# Basic idea: meshes are just resources like any other

The idea is that we treat meshes like any other resource. We define a common behavior for mesh controllers.

We stand up a controller actor, and ActorMesh::allocate, ProcMesh::spawn, ActorMesh::spawn, etc. -- all they do is issue a CreateOrUpdate. (This makes them nonblocking -- which is our eventual goal -- but we have a few more things to do before we're fully ready for this.)

We can now query states, etc., by just calling GetState on the *mesh* resource. This is managed by the controller, and returns immediately (nonblocking). The controller is responsible for polling/pushing/keepaliving the underlying resources.

# Supervision events

The mesh controller can "raise" supervision events in the following way: when we create a resource, we include a "supervisor". The controller synthesizes and then pushes supervision events to this port. In effect, any time a rank in a mesh enters non-running status, we generate a supervision event. The controller keeps track of which events have been raised so as to not duplicate them.

# Rationale

Having a common resource model is very powerful. For one, it establishes a clean separation of concerns (controller responsible for managing a mesh vs. the owner of the mesh), and it allows us to build common tooling for managing all mesh types centered around a common interface (the mesh resource behavior). It also means that 1) we can have common tooling for *all* resources (a mesh is now named just by its controller actor+Name, and we can query it like any other resource, e.g., with `hyper`); 2) it integrates cleanly with observability / "Monarchy".

Differential Revision: [D85982104](https://our.internmc.facebook.com/intern/diff/D85982104/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D85982104/)!

[ghstack-poisoned]
mariusae added a commit that referenced this pull request Nov 6, 2025
Pull Request resolved: #1729

# Basic idea: meshes are just resources like any other

The idea is that we treat meshes like any other resource. We define a common behavior for mesh controllers.

We stand up a controller actor, and ActorMesh::allocate, ProcMesh::spawn, ActorMesh::spawn, etc. -- all they do is issue a CreateOrUpdate. (This makes them nonblocking -- which is our eventual goal -- but we have a few more things to do before we're fully ready for this.)

We can now query states, etc., by just calling GetState on the *mesh* resource. This is managed by the controller, and returns immediately (nonblocking). The controller is responsible for polling/pushing/keepaliving the underlying resources.

# Supervision events

The mesh controller can "raise" supervision events in the following way: when we create a resource, we include a "supervisor". The controller synthesizes and then pushes supervision events to this port. In effect, any time a rank in a mesh enters non-running status, we generate a supervision event. The controller keeps track of which events have been raised so as to not duplicate them.

# Rationale

Having a common resource model is very powerful. For one, it establishes a clean separation of concerns (controller responsible for managing a mesh vs. the owner of the mesh), and it allows us to build common tooling for managing all mesh types centered around a common interface (the mesh resource behavior). It also means that 1) we can have common tooling for *all* resources (a mesh is now named just by its controller actor+Name, and we can query it like any other resource, e.g., with `hyper`); 2) it integrates cleanly with observability / "Monarchy".
ghstack-source-id: 321276801
@exported-using-ghexport

Differential Revision: [D85982104](https://our.internmc.facebook.com/intern/diff/D85982104/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D85982104/)!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants