-
-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Working prototype of experiment sequence #2461
base: main
Are you sure you want to change the base?
Conversation
This pull request introduces 5 alerts when merging 421293e into d88aca2 - view on LGTM.com new alerts:
|
Also I saw this FR #2187 and I think that with some tricks on pickling it's possible to adopt loky launcher to what is described there (of substitute one for another as they are doing same thing) |
Thanks @Zhylkaaa. I'll give this a review shortly. |
Hi @Zhylkaaa! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
@Jasha10 I have added my take on implementing multiprocessing launcher for hydra (I can open separate PR with that launcher removing experiment sequence part) |
This pull request introduces 12 alerts when merging 587c509 into 035ffb5 - view on LGTM.com new alerts:
Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine ⚙️ that powers LGTM.com. For more information, please check out our post on the GitHub blog. |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Zhylkaaa. I'm going to push a few minor changes and will follow up with some comments / questions.
@@ -65,6 +66,7 @@ def launch( | |||
idx = initial_job_idx + idx | |||
lst = " ".join(filter_overrides(overrides)) | |||
log.info(f"\t#{idx} : {lst}") | |||
print(overrides) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print(overrides) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CodeQL found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.
This pull request introduces 6 alerts when merging 162249f into afde761 - view on LGTM.com new alerts:
Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. It looks like GitHub code scanning with CodeQL is already set up for this repo, so no further action is needed 🚀. For more information, please check out our post on the GitHub blog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I like your idea of delegating scheduling to the launcher.
My main concern is backwards compatibility. Facebook/Meta has a pretty strong internal requirement for backwards compat, so I don't think we can merge this unless the below issues are addressed:
# Number of parallel workers | ||
n_jobs: int = 2 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change.
Instead of changing the API for OptunaSweeperConf
, what if we call ConfigStore.store
twice? We can do something like store(node=OptunaSweeperConfV2, name="optuna_v2")
for the new API and store(node=OptunaSweeperConf, name="optuna")
for backward compatibility.
hydra/plugins/launcher.py
Outdated
self, job_overrides: Sequence[Sequence[str]], initial_job_idx: int | ||
self, | ||
job_overrides: Union[Sequence[Sequence[str]], ExperimentSequence], | ||
initial_job_idx: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change. Type checkers (e.g. mypy) will complain about downstream launchers (including custom launcher plugins that users have written).
Instead of changing the API of Launcher.launch
, what if we define a new method Launcher.launch_experiment_sequence
? We can provide a default implementation that raises NotImplementedError
.
162249f
to
f59ec7b
Compare
This pull request introduces 14 alerts when merging f59ec7b into c48ef19 - view on LGTM.com new alerts:
Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. It looks like GitHub code scanning with CodeQL is already set up for this repo, so no further action is needed 🚀. For more information, please check out our post on the GitHub blog. |
Hi @Jasha10, I've edited this PR according to what we where talking about and it seems to work. Only issue is ax sweeper and aws launcher. I can't figure out that is the issue and if it's me who caused it? |
Thanks @Zhylkaaa. I'll take a look shortly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only issue is ax sweeper and aws launcher. I can't figure out that is the issue and if it's me who caused it?
No, this is not your fault. The ax sweeper is failing on the main branch too.
@@ -193,6 +193,47 @@ def test_optuna_example(with_commandline: bool, tmpdir: Path) -> None: | |||
assert returns["best_value"] <= 2.27 | |||
|
|||
|
|||
@mark.parametrize("with_commandline", (True, False)) | |||
def test_optuna_example(with_commandline: bool, tmpdir: Path) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_optuna_example(with_commandline: bool, tmpdir: Path) -> None: | |
def test_optuna_v2_example(with_commandline: bool, tmpdir: Path) -> None: |
This prevents name-collision with the other test_optuna_example
function above.
"example/sphere_sequence.py", | ||
"--multirun", | ||
"hydra.sweep.dir=" + str(tmpdir), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"example/sphere_sequence.py", | |
"--multirun", | |
"hydra.sweep.dir=" + str(tmpdir), | |
"example/sphere_sequence.py", | |
"--multirun", | |
"hydra/sweeper=optuna_v2", | |
"hydra.sweep.dir=" + str(tmpdir), |
Adding the override hydra/sweeper=optuna_v2
makes sure the new OptunaSweeperConfV2
gets used.
Also sorry @Jasha10 for not bringing this up earlier, but in |
Thanks @Zhylkaaa. I'll review this shortly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I've been slow on this.
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. | |
# | |
# Licensed under the Apache License, Version 2.0 (the "License"); | |
# you may not use this file except in compliance with the License. | |
# You may obtain a copy of the License at | |
# | |
# http://www.apache.org/licenses/LICENSE-2.0 | |
# | |
# Unless required by applicable law or agreed to in writing, software | |
# distributed under the License is distributed on an "AS IS" BASIS, | |
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
# See the License for the specific language governing permissions and | |
# limitations under the License. | |
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved | |
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved |
No need to add the license to sweeper.py
since sweeper.py
is not otherwise modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me check, maybe I have added some changes and haven't committed them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I originally had ExperimentSequence
in sweeper file and forgot to remove license, sorry. I think this is otherwise good to go (except we can refactor multiprocessing launcher, but it would take too much time, so I think it will be next PR)
@@ -2,7 +2,7 @@ | |||
import logging | |||
from dataclasses import dataclass | |||
from pathlib import Path | |||
from typing import List, Optional, Sequence | |||
from typing import List, Optional, Sequence, Union |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from typing import List, Optional, Sequence, Union | |
from typing import List, Optional, Sequence |
Can you explain the motivation? Can you explain why moving the batching logic to the launchers as opposed to implementing it in the sweepers boosts GPU utilization? |
hey @omry , I would like to illustrate it with an example: Everything is fine when hidden_size values are close to each other, but for wide sweeps (with many different values to search through) we might quite possibly end up with a batch of let's say 8 jobs 7 of which will finish in an hour (because of the same size or early stopping), while one job will run for the next 3 hours. The current implementation will just wait for 3 hours before returning the batch of jobs to optuna study and then draw the next batch, which leaves 7 executors idle waiting for 1 (in fact we saw >25% idle time for 24+h experiments). In the case of GPUs, it's a big waste and this idle time costs a lot in terms of infrastructure (an hour of AWS 8xv100 costs ~18$) and experiment duration overall (consider man hours). One possible way to solve this is to overdraw experiments (instead of sampling 8 jobs let's sample 16) and hope that this will smooth things out a bit. This is a viable approach, but considering that 8 points are sometimes sufficient to discard a relatively big region of search space end next 8 evaluated points are quite possibly outdated - I would call this approach sub-optimal So we propose this solution that introduces the |
The Optuna Sweeper is really expecting the launcher to be asynchronous. Something along these lines: class AsyncLauncher(Plugin):
def submit(self, job_overrides: Sequence[str]) -> int # job id
def await(self, job_id: int) -> JobReturn
def cancel(self, job_id: int)
def awaitAll(self) Synchronous launchers could be implemented in terms of asynchronous operations. |
Thanks for the feedback @omry. EDIT: |
Hi @omry, As @Zhylkaaa mentioned, the batch can be extremely uneven leading to wasted resources. The proposed solution is meant to solve this feedback loop problem for uneven batches. We don't expect a GPU ordinal to influence final accuracy of a model, so we don't think that scheduling should take place in a sweeper. Also, we shouldn't expect that sweeper developers should take variety of different system architectures under consideration, while developing these plugins. So this PR is meant to provide a further abstraction for scheduling/feedback loop, which we believe should be a layer between the sweeper and the launcher. Launchers are asynchronous right now and we don't want to mess with them too much. We discussed this approach with @Jasha10 , and came to the conclusion that our solution is decent enough. We probably won't have any more time to start with this from the ground up. |
|
Hi @omry,
^ This will report results immediately as they come. And you also can launch new jobs with configurations sampled from updated study (with new results taken into account) by additionally writing something like:
Probably another added benefit is that you can add custom class for ExperimentSequence and tailor slurm job config to not over allocate resources and utilize cluster nodes better (this also influences how fast your tasks will be scheduled). At least I think this is possible on job config side. (@Jasha10 correct me please if you can because I never worked with submitit only CLI sbatch) We can add this feature to launchers that we know how update, but we need some kind of reassurance that this effort is worth something. |
# Conflicts: # plugins/hydra_joblib_launcher/hydra_plugins/hydra_joblib_launcher/_core.py
Hi @Jasha10 @omry I was also considering major refactor of multiprocessing launcher, but i am not sure this make sense? |
Hi @Zhylkaaa, I'm no longer working at Meta -- Sorry to say that I don't have the bandwidth to give this feature the attention that it deserves. |
Hi @Jasha10, I am sorry to hear that. Is there any option when we can get back to it in foreseeable future? Thank you for your time. |
I seem to recall feeling that the abstraction was acceptable last time I looked at this PR. That being said, I do not completely understand the tradeoffs around @omry's You said earlier:
I will have to think about this... I am not clear at the moment about the advantages and disadvantages of the async API. |
Motivation
This PR moves logic of batching and creating jobs to launcher, so resources can be utilized better. Boosts GPU utilization significantly.
(Write your motivation for proposed changes here.)
Have you read the Contributing Guidelines on pull requests?
Yes
Test Plan
Not all launchers support new feature, but if this change is worth adding we will work on adopting all launchers to that feature.
1 test in optuna still doesn't work, I will debug it in nearest future.
Related Issues and PRs
PR is the result of #2435
(Is this PR part of a group of changes? Link the other relevant PRs and Issues here. Use https://help.github.com/en/articles/closing-issues-using-keywords for help on GitHub syntax)
@Jasha10 can you please take a look