Metric Logging updates 4/N - better actor name #351

felipemello1 · 2025-10-08T19:16:44Z

When logging per rank, we need good naming to make it easier to debug. Now provisioner.py::get_proc_mesh provides a mesh_name (#378), so i am just using it.

User can also pass a process name as input. Thats how we get the Controller name.

mlogger = await get_or_create_metric_logger(process_name="Controller")

The process_name then goes: LocalFetcherActor -> MetricCollector -> backend.init --> wandb.init(name)

…estamp_logging_diff2

…estamp_logging_diff3

allenwang28 · 2025-10-08T22:35:46Z

could you add some more comments in the description about what this PR is enabling?

felipemello1 · 2025-10-08T22:37:35Z

could you add some more comments in the description about what this PR is enabling?

yes! sorry. I should have marked as a draft. I am doing a 2.5/4.0 before i ask you to review it

apps/grpo/qwen3_1_7b.yaml

felipemello1 · 2025-10-09T03:22:41Z

src/forge/observability/metrics.py

            self.timestamp = datetime.now(pytz.UTC).timestamp()


-def get_actor_name_with_rank() -> str:


moved to observability/utils.py

…estamp_logging_diff3

felipemello1 · 2025-10-09T20:16:10Z

src/forge/observability/metric_actors.py

+    if process_name is None:
+        process_name = detect_actor_name_from_call_stack()


get name here and pass it to

local_fetcher_actor = proc.spawn( "local_fetcher_actor", LocalFetcherActor, global_logger, process_name )

this function is called in provisioner.py, and thats how we get the process_name for every wandb run

felipemello1 · 2025-10-09T20:16:26Z

src/forge/observability/utils.py

+logger = logging.getLogger(__name__)
+
+
+def detect_actor_name_from_call_stack() -> str:


main file to review

felipemello1 · 2025-10-10T22:54:36Z

in the near future the mesh might hold a name. When this happens, we can delete the function to use the call stack and just get it from the mesh. The rest of the PR stands.

…estamp_logging_diff3

ebsmothers

a few small comments but stamping to unblock

ebsmothers · 2025-10-14T21:38:51Z

tests/sandbox/toy_rl/toy_metrics/main.py

    # Spawn services first (triggers registrations via provisioner hook)
-    trainer = await TrainActor.options(**service_config).as_service()
-    generator = await GeneratorActor.options(**service_config).as_service()
+    trainer = await TrainActor.options(


Dumb q but what are we doing with this toy_rl stuff anyways? (Like is there some reason we're still keeping it around?)

i still find it useful, e.g. i dont need to have gpus available to test it. Eventually this should become an integration test.

ebsmothers · 2025-10-14T21:43:21Z

src/forge/observability/metrics.py


    Timestamp is automatically set to current EST time if not provided.
+
+    Args:


nit: give an actual docstring here (otherwise i can just read this info 5 lines below)

ebsmothers · 2025-10-14T21:45:58Z

src/forge/observability/metric_actors.py

+    if process_name is None:
+        ctx = context()
+        process_name = ctx.actor_instance.actor_id.actor_name


a small thing but it's not immediately clear why we do this here vs get_proc_name_with_rank in other places. (after looking at the code i think it's a global vs local thing, but imo this could be more clearly documented)

get_proc_name_with_rank returns it with replica id and rank. Here i just want the process name, so i would have to parse it :/

…estamp_logging_diff3

codecov-commenter · 2025-10-15T15:58:10Z

Codecov Report

❌ Patch coverage is 92.20779% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.14%. Comparing base (4c14792) to head (e901ad5).
⚠️ Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
src/forge/observability/metric_actors.py	57.14%	3 Missing ⚠️
src/forge/observability/metrics.py	62.50%	3 Missing ⚠️
tests/unit_tests/observability/conftest.py	50.00%	2 Missing ⚠️
tests/unit_tests/observability/test_utils.py	93.54%	2 Missing ⚠️
src/forge/controller/provisioner.py	0.00%	1 Missing ⚠️
src/forge/observability/utils.py	94.73%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #351      +/-   ##
==========================================
- Coverage   73.68%   65.14%   -8.54%     
==========================================
  Files          81       82       +1     
  Lines        7729     7850     +121     
==========================================
- Hits         5695     5114     -581     
- Misses       2034     2736     +702

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-authored-by: Felipe Mello <[email protected]>

)" This reverts commit 1f45470.

…#351)" (meta-pytorch#429) This reverts commit 633b219.

Co-authored-by: Felipe Mello <[email protected]>

)" (meta-pytorch#429)

Felipe Mello added 11 commits October 8, 2025 08:38

commit

77488cf

commit

feb4771

update backend role typehints and enum

41ceaa4

update where we check FORGE_DISABLE_METRICS

8a24e71

remove protected import

3f3bc51

Merge branch 'timestamp_logging_diff1' into timestamp_logging_diff2

d82c354

protect import

4fe2611

Merge branch 'timestamp_logging_diff1' into timestamp_logging_diff2

8759bc8

Merge branch 'main' of https://github.com/meta-pytorch/forge into tim…

fbb4a9e

…estamp_logging_diff2

record_metric uses dataclass Metric

d81a4ed

commit

1e2255d

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 8, 2025

Merge branch 'main' of https://github.com/meta-pytorch/forge into tim…

a94c612

…estamp_logging_diff3

felipemello1 marked this pull request as draft October 8, 2025 22:37

felipemello1 changed the title ~~Metric Logging updates 3/4~~ [wip] Metric Logging updates 3/4 Oct 8, 2025

Felipe Mello added 4 commits October 8, 2025 19:03

commit

5b477e8

commit

f2b3eed

revert

471b88a

Merge branch 'timestamp_logging_diff2_5' into timestamp_logging_diff3

1a02784

felipemello1 changed the title ~~[wip] Metric Logging updates 3/4~~ Metric Logging updates 4/N Oct 9, 2025

felipemello1 commented Oct 9, 2025

View reviewed changes

apps/grpo/qwen3_1_7b.yaml Show resolved Hide resolved

remove unnecessary code

fa4895f

felipemello1 commented Oct 9, 2025

View reviewed changes

better logging

7bb1fe7

felipemello1 marked this pull request as ready for review October 9, 2025 03:27

Felipe Mello added 3 commits October 9, 2025 07:23

docs/names

43d5d27

Merge branch 'timestamp_logging_diff2_5' into timestamp_logging_diff3

c97eb98

Merge branch 'main' of https://github.com/meta-pytorch/forge into tim…

70e9c67

…estamp_logging_diff3

felipemello1 requested a review from allenwang28 October 9, 2025 19:52

felipemello1 commented Oct 9, 2025

View reviewed changes

update cfg back to true

1186aec

felipemello1 assigned joecummings and ebsmothers Oct 10, 2025

Merge branch 'main' of https://github.com/meta-pytorch/forge into tim…

a02ea75

…estamp_logging_diff3

felipemello1 changed the title ~~Metric Logging updates 4/N~~ Metric Logging updates 4/N - better actor name Oct 14, 2025

Felipe Mello added 3 commits October 13, 2025 18:22

Merge branch 'main' of https://github.com/meta-pytorch/forge into tim…

7d89f5c

…estamp_logging_diff3

remove callstack, get meshname in provisioner

370c4e4

get name from proc mesh

9e77930

felipemello1 marked this pull request as draft October 14, 2025 15:31

Felipe Mello added 2 commits October 14, 2025 12:55

simplify + unit tests

93b0cad

Merge branch 'main' of https://github.com/meta-pytorch/forge into tim…

84363b1

…estamp_logging_diff3

felipemello1 marked this pull request as ready for review October 14, 2025 20:06

ebsmothers approved these changes Oct 14, 2025

View reviewed changes

Felipe Mello added 2 commits October 15, 2025 08:15

Merge branch 'main' of https://github.com/meta-pytorch/forge into tim…

77e426b

…estamp_logging_diff3

address comments

e901ad5

felipemello1 merged commit 1f45470 into meta-pytorch:main Oct 15, 2025
9 checks passed

allenwang28 pushed a commit to allenwang28/forge that referenced this pull request Oct 15, 2025

Metric Logging updates 4/N - better actor name (meta-pytorch#351)

0496edb

Co-authored-by: Felipe Mello <[email protected]>

allenwang28 added a commit to allenwang28/forge that referenced this pull request Oct 15, 2025

Revert "Metric Logging updates 4/N - better actor name (meta-pytorch#351

fffcb88

)" This reverts commit 1f45470.

allenwang28 added a commit that referenced this pull request Oct 15, 2025

Revert "Metric Logging updates 4/N - better actor name (#351)" (#429)

633b219

felipemello1 mentioned this pull request Oct 17, 2025

fix - Metric logging work with new monarch API #451

Merged

felipemello1 pushed a commit to felipemello1/forge that referenced this pull request Oct 17, 2025

Reapply "Metric Logging updates 4/N - better actor name (meta-pytorch…

92326bc

…#351)" (meta-pytorch#429) This reverts commit 633b219.

felipemello1 pushed a commit to felipemello1/forge that referenced this pull request Oct 17, 2025

Reapply "Metric Logging updates 4/N - better actor name (meta-pytorch…

44c9883

…#351)" (meta-pytorch#429) This reverts commit 633b219.

HosseinKaviani-H pushed a commit to HosseinKaviani-H/forge that referenced this pull request Oct 21, 2025

Metric Logging updates 4/N - better actor name (meta-pytorch#351)

1e99351

Co-authored-by: Felipe Mello <[email protected]>

photomz pushed a commit to photomz/forge that referenced this pull request Oct 25, 2025

Metric Logging updates 4/N - better actor name (meta-pytorch#351)

2be8960

Co-authored-by: Felipe Mello <[email protected]>

photomz pushed a commit to photomz/forge that referenced this pull request Oct 25, 2025

Revert "Metric Logging updates 4/N - better actor name (meta-pytorch#351

3cf0107

)" (meta-pytorch#429)

		self.timestamp = datetime.now(pytz.UTC).timestamp()


		def get_actor_name_with_rank() -> str:

		if process_name is None:
		process_name = detect_actor_name_from_call_stack()

		logger = logging.getLogger(__name__)


		def detect_actor_name_from_call_stack() -> str:


		Timestamp is automatically set to current EST time if not provided.

		Args:

Uh oh!

Metric Logging updates 4/N - better actor name #351

Metric Logging updates 4/N - better actor name #351

Uh oh!

Conversation

felipemello1 commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

allenwang28 commented Oct 8, 2025

Uh oh!

felipemello1 commented Oct 8, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipemello1 commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Oct 15, 2025

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

felipemello1 commented Oct 8, 2025 •

edited

Loading

felipemello1 commented Oct 10, 2025 •

edited

Loading