-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: reinit="allow" setting for multiple simultaneous runs in one process #9562
base: main
Are you sure you want to change the base?
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #9562 +/- ##
==========================================
- Coverage 80.48% 80.44% -0.04%
==========================================
Files 737 738 +1
Lines 78967 79029 +62
==========================================
+ Hits 63554 63573 +19
- Misses 14683 14725 +42
- Partials 730 731 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
d7c7f89
to
d6495b6
Compare
57349c1
to
219caae
Compare
d6495b6
to
a214c23
Compare
a214c23
to
dfef31c
Compare
dfef31c
to
97e1220
Compare
219caae
to
1459bda
Compare
97e1220
to
34d6234
Compare
34d6234
to
ef35260
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it interesting to test combinations of different reinit modes in the same script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file includes tests for interactions that are special; is there anything missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me read it more carefully again!
Note that if another run already exists, `wandb.run` will continue to | ||
refer to it and will not be updated to the new run. When that run | ||
finishes, `wandb.run` will become `None`. This may affect some | ||
integrations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this behavior of global run is going to cause some confusion with users, as it will require to carefully track when they create and finish each run, maybe a better solution is to disallow global run when we have multiple active runs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or have multiple global runs? and they need to use id or something, not sure if this approach is better, but an alternative to fully disabling global run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior is for integrations; end users should avoid using wandb.run
.
An alternative I considered is to make accesses to wandb.run
raise an error after reinit="allow"
is used. But we check wandb.run
often in our own code (for example, to get its settings), so without fixing all those instances, it would break the SDK.
So wandb.run
needs to either be None or a run. If starting a reinit="allow"
run were to change it to None or the new run, then integrations relying on wandb.run
that were initialized before the run would break or misbehave.
The rule in this PR is that wandb.run
can only change when you call wandb.init()
and refers to the same run until that run finishes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm i see... i guess i was thinking of integrations in third party repo, where i don't think they change these settings, but i guess there is a case for integrations in our code base.
i guess it's fine, we probably should document it and discourage certain usage patterns especially with integrations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't mention using wandb.run
in our guide for integrations, and I recently updated those docs and some other top-level docs to not use the global run.
ef35260
to
24e7034
Compare
24e7034
to
d88c9af
Compare
d5de854
to
100ae87
Compare
d88c9af
to
785422b
Compare
Factors out a `monitor.GPUResourceManager` struct that manages the `gpu_stats` process and whose lifetime is tied to the server rather than a run. When using `reinit="allow"` (added in #9562) only a single `gpu_stats` process is started up when multiple runs are active. The process is killed when no runs are active.
100ae87
to
526f4dc
Compare
785422b
to
ca7aeea
Compare
ca7aeea
to
0da4fca
Compare
Adds an
allow
option to thereinit
setting which causeswandb.init()
to create a new run even if other runs exist. This should eventually become the default, and the other options ("return_previous" and "finish_previous") should be deprecated.Although it's possible to pass
reinit
as awandb.init()
argument, the best way to use this is withwandb.setup()
.Here is an example that uses the
allow
setting to (serially) run multiple sub-experiments while logging their results to another run: