-
Notifications
You must be signed in to change notification settings - Fork 40
fix - Metric logging work with new monarch API #451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix - Metric logging work with new monarch API #451
Conversation
…-metric-logging-reland
404ee1d to
00ccf0c
Compare
|
can ignore the unit test failing. CI is using old monarch version. Error is harmless too. Tests pass locally. |
| ).as_service() | ||
| generator = await GeneratorActor.options( | ||
| **service_config, mesh_name="GeneratorActor" | ||
| ).as_service() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of mesh_name here? Is it for the better display on wandb? Why is it not applied to all main files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ) | ||
| await global_logger.register_fetcher.call_one(local_fetcher_actor, proc) | ||
| # Generate a unique ID to map procmesh to fetcher | ||
| proc._uid = str(uuid.uuid4()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mind approving it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little concerning about the broken CI. Wouldn't it cause all the subsequent commits to break as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think that errors are related to this PR. But let me confirm by opening a dummy PR
| # Deregister local logger from global logger | ||
| if hasattr(proc_mesh, "_local_fetcher"): | ||
| # Deregister LocalFetcherActor from GlobalLoggingActor | ||
| if hasattr(proc_mesh, "_local_fetcher") and hasattr(proc_mesh, "_uid"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for a proc_mesh that has _local_fetcher but not _uid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, they should always have both. I guess i was having extra safe here. Is it confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand you write it like this to be safe. But I just worry it may hide some potential errors. How about raise an error if it has _local_fetcher but not _uid?
Co-authored-by: Felipe Mello <[email protected]>
Co-authored-by: Felipe Mello <[email protected]>
We landed #351, which adds better rank names to logging (later reverted in #429). This adds 351 back, but with fixes for monarch.
Goal of the original PR:

Monarch changed APIs, breaking the code in two ways:
1)
Situation: To get the
actor_nameandProcMeshuid, i was usingmonarch.actor.context().Problem: Now actor_name returns
{actor_name}_{ProcMeshuid}(orclientif outside a PrcoMesh), and calling context.world_name errors if called outside of aProcMeshcontext.Fix: Never call
.world_nameand get the uid from actor_name, if any.Situation: To allow
GlobalLoggingActorto access theLocalFetcherActoron eachProcMesh, we would register the (key,value) as{ProcMesh: LocalFetcherActor}Problem: The ProcMesh is not hashable anymore.
Solution: Instead of using ProcMesh as a key, we create a 'str' UUID for the proc.
Extra: updated docstrings to make clear whats going on