feat: reinit="allow" setting for multiple simultaneous runs in one process #9562

timoffex · 2025-03-05T23:23:28Z

Adds an allow option to the reinit setting which causes wandb.init() to create a new run even if other runs exist. This should eventually become the default, and the other options ("return_previous" and "finish_previous") should be deprecated.

Although it's possible to pass reinit as a wandb.init() argument, the best way to use this is with wandb.setup().

Here is an example that uses the allow setting to (serially) run multiple sub-experiments while logging their results to another run:

wandb.setup(wandb.Settings(reinit="allow"))

with wandb.init() as experiment_results_run:
    for ...:
        with wandb.init() as run:
            # The do_experiment() function logs fine-grained metrics
            # to the given run and then returns result metrics that
            # we'd like to track separately.
            experiment_results = do_experiment(run)

            # After each experiment, we log its results to a parent
            # run. Each point in the parent run's charts corresponds
            # to one experiment's results.
            experiment_results_run.log(experiment_results)

timoffex · 2025-03-05T23:23:43Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

codecov · 2025-03-05T23:28:23Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.44%. Comparing base (526f4dc) to head (0da4fca).
Report is 5 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9562      +/-   ##
==========================================
- Coverage   80.48%   80.44%   -0.04%     
==========================================
  Files         737      738       +1     
  Lines       78967    79029      +62     
==========================================
+ Hits        63554    63573      +19     
- Misses      14683    14725      +42     
- Partials      730      731       +1

Flag	Coverage Δ
func	`46.20% <59.45%> (+0.19%)`	⬆️
system	`64.80% <98.75%> (-0.01%)`	⬇️
unit	`65.26% <27.02%> (-0.10%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
tests/conftest.py	`98.76% <100.00%> (-0.41%)`	⬇️
tests/system_tests/test_core/test_wandb_init.py	`100.00% <ø> (ø)`
...s/system_tests/test_core/test_wandb_init_reinit.py	`100.00% <100.00%> (ø)`
tests/system_tests/test_core/test_wandb_setup.py	`100.00% <100.00%> (ø)`
wandb/sdk/wandb_init.py	`84.52% <100.00%> (+0.04%)`	⬆️
wandb/sdk/wandb_run.py	`88.73% <100.00%> (+0.05%)`	⬆️
wandb/sdk/wandb_settings.py	`91.66% <ø> (ø)`
wandb/sdk/wandb_setup.py	`88.78% <100.00%> (+1.07%)`	⬆️

... and 26 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kptkin · 2025-03-06T21:28:10Z

tests/system_tests/test_core/test_wandb_init_reinit.py

is it interesting to test combinations of different reinit modes in the same script?

This file includes tests for interactions that are special; is there anything missing?

let me read it more carefully again!

wandb/sdk/wandb_setup.py

kptkin · 2025-03-17T19:42:21Z

wandb/sdk/wandb_settings.py

+        Note that if another run already exists, `wandb.run` will continue to
+        refer to it and will not be updated to the new run. When that run
+        finishes, `wandb.run` will become `None`. This may affect some
+        integrations.


i think this behavior of global run is going to cause some confusion with users, as it will require to carefully track when they create and finish each run, maybe a better solution is to disallow global run when we have multiple active runs?

or have multiple global runs? and they need to use id or something, not sure if this approach is better, but an alternative to fully disabling global run

This behavior is for integrations; end users should avoid using wandb.run.

An alternative I considered is to make accesses to wandb.run raise an error after reinit="allow" is used. But we check wandb.run often in our own code (for example, to get its settings), so without fixing all those instances, it would break the SDK.

So wandb.run needs to either be None or a run. If starting a reinit="allow" run were to change it to None or the new run, then integrations relying on wandb.run that were initialized before the run would break or misbehave.

The rule in this PR is that wandb.run can only change when you call wandb.init() and refers to the same run until that run finishes.

hmmm i see... i guess i was thinking of integrations in third party repo, where i don't think they change these settings, but i guess there is a case for integrations in our code base.

i guess it's fine, we probably should document it and discourage certain usage patterns especially with integrations?

We don't mention using wandb.run in our guide for integrations, and I recently updated those docs and some other top-level docs to not use the global run.

Factors out a `monitor.GPUResourceManager` struct that manages the `gpu_stats` process and whose lifetime is tied to the server rather than a run. When using `reinit="allow"` (added in #9562) only a single `gpu_stats` process is started up when multiple runs are active. The process is killed when no runs are active.

timoffex mentioned this pull request Mar 5, 2025

refactor: extend reinit setting to allow string values #9557

Merged

timoffex changed the title ~~reinit allow~~ feat: reinit="allow" setting for multiple simultaneous runs in one process Mar 5, 2025

timoffex force-pushed the 03-05-reinit_allow branch from d7c7f89 to d6495b6 Compare March 5, 2025 23:44

timoffex force-pushed the 03-04-reinit_string branch from 57349c1 to 219caae Compare March 5, 2025 23:44

timoffex mentioned this pull request Mar 6, 2025

Document new reinit='allow' setting wandb/docs#1145

Open

timoffex force-pushed the 03-05-reinit_allow branch from d6495b6 to a214c23 Compare March 6, 2025 00:33

timoffex marked this pull request as ready for review March 6, 2025 00:34

timoffex requested a review from a team as a code owner March 6, 2025 00:34

timoffex force-pushed the 03-05-reinit_allow branch from a214c23 to dfef31c Compare March 6, 2025 00:39

timoffex changed the base branch from 03-04-reinit_string to graphite-base/9562 March 6, 2025 01:25

timoffex force-pushed the 03-05-reinit_allow branch from dfef31c to 97e1220 Compare March 6, 2025 01:25

timoffex force-pushed the graphite-base/9562 branch from 219caae to 1459bda Compare March 6, 2025 01:25

timoffex force-pushed the 03-05-reinit_allow branch from 97e1220 to 34d6234 Compare March 6, 2025 01:25

timoffex changed the base branch from graphite-base/9562 to main March 6, 2025 01:25

timoffex force-pushed the 03-05-reinit_allow branch from 34d6234 to ef35260 Compare March 6, 2025 01:25

kptkin reviewed Mar 6, 2025

View reviewed changes

wandb/sdk/wandb_setup.py Show resolved Hide resolved

timoffex requested a review from kptkin March 10, 2025 19:09

kptkin reviewed Mar 17, 2025

View reviewed changes

wandb/sdk/wandb_setup.py Show resolved Hide resolved

kptkin reviewed Mar 17, 2025

View reviewed changes

timoffex changed the base branch from main to graphite-base/9562 March 25, 2025 00:03

timoffex changed the base branch from graphite-base/9562 to main March 25, 2025 00:04

timoffex changed the base branch from main to graphite-base/9562 March 25, 2025 16:56

timoffex changed the base branch from graphite-base/9562 to main March 25, 2025 16:57

timoffex changed the base branch from main to graphite-base/9562 March 25, 2025 16:58

timoffex changed the base branch from graphite-base/9562 to main March 25, 2025 16:58

timoffex changed the base branch from main to graphite-base/9562 March 25, 2025 16:59

timoffex force-pushed the 03-05-reinit_allow branch from ef35260 to 24e7034 Compare March 25, 2025 16:59

timoffex changed the base branch from graphite-base/9562 to 03-24-multi_run_gpu_metrics March 25, 2025 16:59

timoffex mentioned this pull request Mar 25, 2025

refactor: share gpu_stats process between multiple runs #9632

Merged

timoffex force-pushed the 03-05-reinit_allow branch from 24e7034 to d88c9af Compare March 25, 2025 17:11

timoffex force-pushed the 03-24-multi_run_gpu_metrics branch from d5de854 to 100ae87 Compare March 25, 2025 17:11

timoffex force-pushed the 03-05-reinit_allow branch from d88c9af to 785422b Compare March 25, 2025 17:40

timoffex requested a review from kptkin March 25, 2025 20:41

timoffex changed the base branch from 03-24-multi_run_gpu_metrics to graphite-base/9562 March 25, 2025 20:41

timoffex force-pushed the graphite-base/9562 branch from 100ae87 to 526f4dc Compare March 25, 2025 20:56

timoffex force-pushed the 03-05-reinit_allow branch from 785422b to ca7aeea Compare March 25, 2025 20:56

graphite-app bot changed the base branch from graphite-base/9562 to main March 25, 2025 20:57

reinit allow

0da4fca

timoffex force-pushed the 03-05-reinit_allow branch from ca7aeea to 0da4fca Compare March 25, 2025 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: reinit="allow" setting for multiple simultaneous runs in one process #9562

feat: reinit="allow" setting for multiple simultaneous runs in one process #9562

timoffex commented Mar 5, 2025 •

edited

Loading

timoffex commented Mar 5, 2025 •

edited

Loading

codecov bot commented Mar 5, 2025 •

edited

Loading

kptkin Mar 6, 2025

timoffex Mar 7, 2025

kptkin Mar 17, 2025

kptkin Mar 17, 2025

kptkin Mar 17, 2025

timoffex Mar 25, 2025

kptkin Mar 25, 2025

timoffex Mar 25, 2025

feat: reinit="allow" setting for multiple simultaneous runs in one process #9562

Are you sure you want to change the base?

feat: reinit="allow" setting for multiple simultaneous runs in one process #9562

Conversation

timoffex commented Mar 5, 2025 • edited Loading

timoffex commented Mar 5, 2025 • edited Loading

codecov bot commented Mar 5, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timoffex commented Mar 5, 2025 •

edited

Loading

timoffex commented Mar 5, 2025 •

edited

Loading

codecov bot commented Mar 5, 2025 •

edited

Loading