Skip to content

Actors created on nested proc meshes cannot communicate with each other #1172

@allenwang28

Description

@allenwang28

Creating a tracking issue, from @samlurye on Sept 2:

Over the weekend, I discovered a bug with routing messages between proc meshes. This is a minimal repro:

class PassedActor(Actor):
    @endpoint
    def hello(self) -> None:
        print("hello")


class InnerActor(Actor):
    @endpoint
    async def hello(self, passed: PassedActor) -> None:
        await passed.hello.call()


class OuterActor(Actor):
    @endpoint
    async def hello(self, passed: PassedActor) -> None:
        inner_mesh = proc_mesh(gpus=1)
        await inner_mesh.spawn("inner", InnerActor).hello.call(passed)


passed_mesh = this_host().spawn_procs()
passed_actor = passed_mesh.spawn("passed", PassedActor)
outer_mesh = this_host().spawn_procs()
outer_actor = outer_mesh.spawn("outer", OuterActor)
outer_actor.hello.call_one(passed_actor).get()

The message from InnerActor to PassedActor is unroutable and undeliverable. I believe I have root-caused the issue to the following routing path (I'm using "proc" vs. "process" intentionally here to distinguish between hyperactor Proc and actual CPU process):

  1. Message from InnerActor to PassedActor gets routed to InnerActor's MeshAgent proc.
  2. PassedActor isn't in the MeshAgent proc's muxer, so the message is passed to its forwarder, which is a FallbackMailboxRouter.
  3. The fallback mailbox router first checks InnerActor's process's global_router, which doesn't contain PassedActor, so it dials out to the process where OuterActor is running.
  4. The message ends up at the OuterActor's router, which contains the procs in its mesh and the mesh's client proc (none of which contain PassedActor).
  5. The message falls back to OuterActor's process's global_router, which also doesn't route to PassedActor.
  6. At this point, what we need to happen is for OuterActor's process's global_router to forward to the root client process's global_router, but this plumbing doesn't exist, so delivering the message fails.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions