Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: reinit="allow" setting for multiple simultaneous runs in one process #9562

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

timoffex
Copy link
Contributor

@timoffex timoffex commented Mar 5, 2025

Adds an allow option to the reinit setting which causes wandb.init() to create a new run even if other runs exist. This should eventually become the default, and the other options ("return_previous" and "finish_previous") should be deprecated.

Although it's possible to pass reinit as a wandb.init() argument, the best way to use this is with wandb.setup().

Here is an example that uses the allow setting to (serially) run multiple sub-experiments while logging their results to another run:

wandb.setup(wandb.Settings(reinit="allow"))

with wandb.init() as experiment_results_run:
    for ...:
        with wandb.init() as run:
            # The do_experiment() function logs fine-grained metrics
            # to the given run and then returns result metrics that
            # we'd like to track separately.
            experiment_results = do_experiment(run)

            # After each experiment, we log its results to a parent
            # run. Each point in the parent run's charts corresponds
            # to one experiment's results.
            experiment_results_run.log(experiment_results)

Copy link
Contributor Author

timoffex commented Mar 5, 2025

@timoffex timoffex changed the title reinit allow feat: reinit="allow" setting for multiple simultaneous runs in one process Mar 5, 2025
Copy link

codecov bot commented Mar 5, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.44%. Comparing base (526f4dc) to head (0da4fca).
Report is 5 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #9562      +/-   ##
==========================================
- Coverage   80.48%   80.44%   -0.04%     
==========================================
  Files         737      738       +1     
  Lines       78967    79029      +62     
==========================================
+ Hits        63554    63573      +19     
- Misses      14683    14725      +42     
- Partials      730      731       +1     
Flag Coverage Δ
func 46.20% <59.45%> (+0.19%) ⬆️
system 64.80% <98.75%> (-0.01%) ⬇️
unit 65.26% <27.02%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
tests/conftest.py 98.76% <100.00%> (-0.41%) ⬇️
tests/system_tests/test_core/test_wandb_init.py 100.00% <ø> (ø)
...s/system_tests/test_core/test_wandb_init_reinit.py 100.00% <100.00%> (ø)
tests/system_tests/test_core/test_wandb_setup.py 100.00% <100.00%> (ø)
wandb/sdk/wandb_init.py 84.52% <100.00%> (+0.04%) ⬆️
wandb/sdk/wandb_run.py 88.73% <100.00%> (+0.05%) ⬆️
wandb/sdk/wandb_settings.py 91.66% <ø> (ø)
wandb/sdk/wandb_setup.py 88.78% <100.00%> (+1.07%) ⬆️

... and 26 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@timoffex timoffex force-pushed the 03-05-reinit_allow branch from d7c7f89 to d6495b6 Compare March 5, 2025 23:44
@timoffex timoffex force-pushed the 03-04-reinit_string branch from 57349c1 to 219caae Compare March 5, 2025 23:44
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from d6495b6 to a214c23 Compare March 6, 2025 00:33
@timoffex timoffex marked this pull request as ready for review March 6, 2025 00:34
@timoffex timoffex requested a review from a team as a code owner March 6, 2025 00:34
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from a214c23 to dfef31c Compare March 6, 2025 00:39
@timoffex timoffex changed the base branch from 03-04-reinit_string to graphite-base/9562 March 6, 2025 01:25
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from dfef31c to 97e1220 Compare March 6, 2025 01:25
@timoffex timoffex force-pushed the graphite-base/9562 branch from 219caae to 1459bda Compare March 6, 2025 01:25
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from 97e1220 to 34d6234 Compare March 6, 2025 01:25
@timoffex timoffex changed the base branch from graphite-base/9562 to main March 6, 2025 01:25
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from 34d6234 to ef35260 Compare March 6, 2025 01:25
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it interesting to test combinations of different reinit modes in the same script?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file includes tests for interactions that are special; is there anything missing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me read it more carefully again!

@timoffex timoffex requested a review from kptkin March 10, 2025 19:09
Comment on lines +281 to +427
Note that if another run already exists, `wandb.run` will continue to
refer to it and will not be updated to the new run. When that run
finishes, `wandb.run` will become `None`. This may affect some
integrations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this behavior of global run is going to cause some confusion with users, as it will require to carefully track when they create and finish each run, maybe a better solution is to disallow global run when we have multiple active runs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or have multiple global runs? and they need to use id or something, not sure if this approach is better, but an alternative to fully disabling global run

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is for integrations; end users should avoid using wandb.run.

An alternative I considered is to make accesses to wandb.run raise an error after reinit="allow" is used. But we check wandb.run often in our own code (for example, to get its settings), so without fixing all those instances, it would break the SDK.

So wandb.run needs to either be None or a run. If starting a reinit="allow" run were to change it to None or the new run, then integrations relying on wandb.run that were initialized before the run would break or misbehave.

The rule in this PR is that wandb.run can only change when you call wandb.init() and refers to the same run until that run finishes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm i see... i guess i was thinking of integrations in third party repo, where i don't think they change these settings, but i guess there is a case for integrations in our code base.

i guess it's fine, we probably should document it and discourage certain usage patterns especially with integrations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't mention using wandb.run in our guide for integrations, and I recently updated those docs and some other top-level docs to not use the global run.

@timoffex timoffex changed the base branch from main to graphite-base/9562 March 25, 2025 00:03
@timoffex timoffex changed the base branch from graphite-base/9562 to main March 25, 2025 00:04
@timoffex timoffex changed the base branch from main to graphite-base/9562 March 25, 2025 16:56
@timoffex timoffex changed the base branch from graphite-base/9562 to main March 25, 2025 16:57
@timoffex timoffex changed the base branch from main to graphite-base/9562 March 25, 2025 16:58
@timoffex timoffex changed the base branch from graphite-base/9562 to main March 25, 2025 16:58
@timoffex timoffex changed the base branch from main to graphite-base/9562 March 25, 2025 16:59
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from ef35260 to 24e7034 Compare March 25, 2025 16:59
@timoffex timoffex changed the base branch from graphite-base/9562 to 03-24-multi_run_gpu_metrics March 25, 2025 16:59
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from 24e7034 to d88c9af Compare March 25, 2025 17:11
@timoffex timoffex force-pushed the 03-24-multi_run_gpu_metrics branch from d5de854 to 100ae87 Compare March 25, 2025 17:11
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from d88c9af to 785422b Compare March 25, 2025 17:40
@timoffex timoffex requested a review from kptkin March 25, 2025 20:41
@timoffex timoffex changed the base branch from 03-24-multi_run_gpu_metrics to graphite-base/9562 March 25, 2025 20:41
timoffex added a commit that referenced this pull request Mar 25, 2025
Factors out a `monitor.GPUResourceManager` struct that manages the `gpu_stats` process and whose lifetime is tied to the server rather than a run.

When using `reinit="allow"` (added in #9562) only a single `gpu_stats` process is started up when multiple runs are active. The process is killed when no runs are active.
@timoffex timoffex force-pushed the graphite-base/9562 branch from 100ae87 to 526f4dc Compare March 25, 2025 20:56
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from 785422b to ca7aeea Compare March 25, 2025 20:56
@graphite-app graphite-app bot changed the base branch from graphite-base/9562 to main March 25, 2025 20:57
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from ca7aeea to 0da4fca Compare March 25, 2025 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants