-
Notifications
You must be signed in to change notification settings - Fork 105
Open
Milestone
Description
Creating a tracking issue, from @samlurye on Sept 2:
Over the weekend, I discovered a bug with routing messages between proc meshes. This is a minimal repro:
class PassedActor(Actor):
@endpoint
def hello(self) -> None:
print("hello")
class InnerActor(Actor):
@endpoint
async def hello(self, passed: PassedActor) -> None:
await passed.hello.call()
class OuterActor(Actor):
@endpoint
async def hello(self, passed: PassedActor) -> None:
inner_mesh = proc_mesh(gpus=1)
await inner_mesh.spawn("inner", InnerActor).hello.call(passed)
passed_mesh = this_host().spawn_procs()
passed_actor = passed_mesh.spawn("passed", PassedActor)
outer_mesh = this_host().spawn_procs()
outer_actor = outer_mesh.spawn("outer", OuterActor)
outer_actor.hello.call_one(passed_actor).get()
The message from InnerActor to PassedActor is unroutable and undeliverable. I believe I have root-caused the issue to the following routing path (I'm using "proc" vs. "process" intentionally here to distinguish between hyperactor Proc and actual CPU process):
- Message from InnerActor to PassedActor gets routed to InnerActor's MeshAgent proc.
- PassedActor isn't in the MeshAgent proc's muxer, so the message is passed to its forwarder, which is a FallbackMailboxRouter.
- The fallback mailbox router first checks InnerActor's process's global_router, which doesn't contain PassedActor, so it dials out to the process where OuterActor is running.
- The message ends up at the OuterActor's router, which contains the procs in its mesh and the mesh's client proc (none of which contain PassedActor).
- The message falls back to OuterActor's process's global_router, which also doesn't route to PassedActor.
- At this point, what we need to happen is for OuterActor's process's global_router to forward to the root client process's global_router, but this plumbing doesn't exist, so delivering the message fails.