[WIP] Working prototype of experiment sequence #2461

Zhylkaaa · 2022-11-10T10:39:59Z

Motivation

This PR moves logic of batching and creating jobs to launcher, so resources can be utilized better. Boosts GPU utilization significantly.
(Write your motivation for proposed changes here.)

Have you read the Contributing Guidelines on pull requests?

Yes

Test Plan

Not all launchers support new feature, but if this change is worth adding we will work on adopting all launchers to that feature.

1 test in optuna still doesn't work, I will debug it in nearest future.

Related Issues and PRs

PR is the result of #2435
(Is this PR part of a group of changes? Link the other relevant PRs and Issues here. Use https://help.github.com/en/articles/closing-issues-using-keywords for help on GitHub syntax)
@Jasha10 can you please take a look

lgtm-com · 2022-11-10T10:49:51Z

This pull request introduces 5 alerts when merging 421293e into d88aca2 - view on LGTM.com

new alerts:

2 for Unused import
1 for Unused local variable
1 for Module is imported with 'import' and 'import from'
1 for Nested loops with same variable

Zhylkaaa · 2022-11-10T15:31:01Z

Also I saw this FR #2187 and I think that with some tricks on pickling it's possible to adopt loky launcher to what is described there (of substitute one for another as they are doing same thing)

Jasha10 · 2022-11-11T09:26:08Z

Thanks @Zhylkaaa. I'll give this a review shortly.

facebook-github-bot · 2022-11-18T15:38:22Z

Hi @Zhylkaaa!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Zhylkaaa · 2022-11-18T15:41:06Z

@Jasha10 I have added my take on implementing multiprocessing launcher for hydra (I can open separate PR with that launcher removing experiment sequence part)

lgtm-com · 2022-11-18T15:46:39Z

This pull request introduces 12 alerts when merging 587c509 into 035ffb5 - view on LGTM.com

new alerts:

5 for Unused import
4 for Nested loops with same variable
2 for Module is imported with 'import' and 'import from'
1 for Unused local variable

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine ⚙️ that powers LGTM.com. For more information, please check out our post on the GitHub blog.

facebook-github-bot · 2022-11-22T22:12:42Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

Jasha10

Thanks @Zhylkaaa. I'm going to push a few minor changes and will follow up with some comments / questions.

Jasha10 · 2022-12-02T23:28:43Z

hydra/_internal/core_plugins/basic_launcher.py

@@ -65,6 +66,7 @@ def launch(
            idx = initial_job_idx + idx
            lst = " ".join(filter_overrides(overrides))
            log.info(f"\t#{idx} : {lst}")
+            print(overrides)


Suggested change

print(overrides)

github-advanced-security

CodeQL found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

lgtm-com · 2022-12-05T22:13:13Z

This pull request introduces 6 alerts when merging 162249f into afde761 - view on LGTM.com

new alerts:

4 for Nested loops with same variable
1 for Unused local variable
1 for Unused import

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. It looks like GitHub code scanning with CodeQL is already set up for this repo, so no further action is needed 🚀. For more information, please check out our post on the GitHub blog.

Jasha10

This looks good. I like your idea of delegating scheduling to the launcher.

My main concern is backwards compatibility. Facebook/Meta has a pretty strong internal requirement for backwards compat, so I don't think we can merge this unless the below issues are addressed:

Jasha10 · 2022-12-05T22:28:18Z

plugins/hydra_optuna_sweeper/hydra_plugins/hydra_optuna_sweeper/config.py

-    # Number of parallel workers
-    n_jobs: int = 2
-


This is a breaking change.

Instead of changing the API for OptunaSweeperConf, what if we call ConfigStore.store twice? We can do something like store(node=OptunaSweeperConfV2, name="optuna_v2") for the new API and store(node=OptunaSweeperConf, name="optuna") for backward compatibility.

Jasha10 · 2022-12-05T22:47:43Z

hydra/plugins/launcher.py

-        self, job_overrides: Sequence[Sequence[str]], initial_job_idx: int
+        self,
+        job_overrides: Union[Sequence[Sequence[str]], ExperimentSequence],
+        initial_job_idx: int,


This is a breaking change. Type checkers (e.g. mypy) will complain about downstream launchers (including custom launcher plugins that users have written).

Instead of changing the API of Launcher.launch, what if we define a new method Launcher.launch_experiment_sequence? We can provide a default implementation that raises NotImplementedError.

lgtm-com · 2022-12-08T06:07:09Z

This pull request introduces 14 alerts when merging f59ec7b into c48ef19 - view on LGTM.com

new alerts:

7 for Unused import
4 for Nested loops with same variable
2 for Module is imported with 'import' and 'import from'
1 for Unused local variable

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. It looks like GitHub code scanning with CodeQL is already set up for this repo, so no further action is needed 🚀. For more information, please check out our post on the GitHub blog.

Zhylkaaa · 2022-12-08T09:39:35Z

Hi @Jasha10, I've edited this PR according to what we where talking about and it seems to work. Only issue is ax sweeper and aws launcher. I can't figure out that is the issue and if it's me who caused it?
However I wanted to ask if it looks like you have envisioned it will look? If so I will do the small refactor and remove cosmetic issues.

Jasha10 · 2022-12-13T09:46:49Z

Thanks @Zhylkaaa. I'll take a look shortly.

Jasha10

Only issue is ax sweeper and aws launcher. I can't figure out that is the issue and if it's me who caused it?

No, this is not your fault. The ax sweeper is failing on the main branch too.

Jasha10 · 2022-12-15T02:52:15Z

plugins/hydra_optuna_sweeper/tests/test_optuna_sweeper_plugin.py

@@ -193,6 +193,47 @@ def test_optuna_example(with_commandline: bool, tmpdir: Path) -> None:
    assert returns["best_value"] <= 2.27


+@mark.parametrize("with_commandline", (True, False))
+def test_optuna_example(with_commandline: bool, tmpdir: Path) -> None:


Suggested change

def test_optuna_example(with_commandline: bool, tmpdir: Path) -> None:

def test_optuna_v2_example(with_commandline: bool, tmpdir: Path) -> None:

This prevents name-collision with the other test_optuna_example function above.

Jasha10 · 2022-12-15T02:53:41Z

plugins/hydra_optuna_sweeper/tests/test_optuna_sweeper_plugin.py

+        "example/sphere_sequence.py",
+        "--multirun",
+        "hydra.sweep.dir=" + str(tmpdir),


Suggested change

"example/sphere_sequence.py",

"--multirun",

"hydra.sweep.dir=" + str(tmpdir),

"example/sphere_sequence.py",

"--multirun",

"hydra/sweeper=optuna_v2",

"hydra.sweep.dir=" + str(tmpdir),

Adding the override hydra/sweeper=optuna_v2 makes sure the new OptunaSweeperConfV2 gets used.

Zhylkaaa · 2022-12-15T12:37:58Z

Also sorry @Jasha10 for not bringing this up earlier, but in optuna_v2 we actually change the way max_failure_rate works. Because of removing the notion of batch, we treat max_failure_rate as a global percent of failed runs, in a sens that out of n_trials, floor(n_trials * max_failure_rate) can fail without an error.

Jasha10 · 2022-12-21T18:07:46Z

Thanks @Zhylkaaa. I'll review this shortly.

Jasha10

Sorry I've been slow on this.

Jasha10 · 2022-12-18T10:51:43Z

hydra/plugins/sweeper.py

+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#           http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved


Suggested change

# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.

#

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved

# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved

No need to add the license to sweeper.py since sweeper.py is not otherwise modified.

let me check, maybe I have added some changes and haven't committed them.

yes, I originally had ExperimentSequence in sweeper file and forgot to remove license, sorry. I think this is otherwise good to go (except we can refactor multiprocessing launcher, but it would take too much time, so I think it will be next PR)

Jasha10 · 2022-12-21T18:06:43Z

hydra/_internal/core_plugins/basic_launcher.py

@@ -2,7 +2,7 @@
 import logging
 from dataclasses import dataclass
 from pathlib import Path
-from typing import List, Optional, Sequence
+from typing import List, Optional, Sequence, Union


Suggested change

from typing import List, Optional, Sequence, Union

from typing import List, Optional, Sequence

omry · 2023-01-26T06:56:07Z

Can you explain the motivation?
"This PR moves logic of batching and creating jobs to launcher, so resources can be utilized better. Boosts GPU utilization significantly."

Can you explain why moving the batching logic to the launchers as opposed to implementing it in the sweepers boosts GPU utilization?

Zhylkaaa · 2023-01-26T11:51:49Z

hey @omry , I would like to illustrate it with an example:
suppose you are running hp optimization for DL models and your sweep config contains architectural parameters like the number of layers or/and hidden layers size.
Now hidden layer size influences the evaluation time of the function because for DL models we typically have to perform (batch_size, hidden_size) x (hidden_size, hidden_size) matrix multiplications with ~O(hidden_size^2 * batch_size) complexity

Everything is fine when hidden_size values are close to each other, but for wide sweeps (with many different values to search through) we might quite possibly end up with a batch of let's say 8 jobs 7 of which will finish in an hour (because of the same size or early stopping), while one job will run for the next 3 hours. The current implementation will just wait for 3 hours before returning the batch of jobs to optuna study and then draw the next batch, which leaves 7 executors idle waiting for 1 (in fact we saw >25% idle time for 24+h experiments). In the case of GPUs, it's a big waste and this idle time costs a lot in terms of infrastructure (an hour of AWS 8xv100 costs ~18$) and experiment duration overall (consider man hours).

One possible way to solve this is to overdraw experiments (instead of sampling 8 jobs let's sample 16) and hope that this will smooth things out a bit. This is a viable approach, but considering that 8 points are sometimes sufficient to discard a relatively big region of search space end next 8 evaluated points are quite possibly outdated - I would call this approach sub-optimal

So we propose this solution that introduces the ExperimentSequence object that serves as a proxy between study and launcher and enables users to customize runs depending on current infrastructure (assign GPUs to jobs for example) and enables launchers to start experiments asynchronously and report results as they arrive to draw more meaningful samples. In the case of the joblib launcher, it won't make any difference, but for new and a few existing launchers this reduces overall experiment time significantly and utilizes resources better.

omry · 2023-01-27T06:12:44Z

The Optuna Sweeper is really expecting the launcher to be asynchronous.
In principle, defining an interface for async launching support feels like a more productive course of action here.

Something along these lines:

class AsyncLauncher(Plugin):
  def submit(self, job_overrides: Sequence[str]) -> int # job id
  def await(self, job_id: int) -> JobReturn
  def cancel(self, job_id: int)
  def awaitAll(self)

Synchronous launchers could be implemented in terms of asynchronous operations.
I am personally no longer involved with Hydra, but I am willing to review such a diff and help get it landed.

Zhylkaaa · 2023-01-27T15:14:00Z

Thanks for the feedback @omry.
So, you propose to make launchers asynchronous and move all the scheduling and awaiting logic to sweeper? We were truing to actually avoid this to separate sweeper and launcher functionality as much as possible.
I think this removes a space for user customization (inheriting ExperimentSequence and keeping track of gpus for ex.)?
I am not sure how much time it will take to rewrite it all again :)

EDIT:
@Jasha10 do you have any thoughts?

jbaczek · 2023-02-08T13:12:58Z

Hi @omry,
The problem, that we wanted to solve is to properly encapsulate the mechanisms of sweeping and launching. Optuna sweeper requires feedback from the jobs to schedule next experiments (it preforms TPE optimization). Thus, it produces batches of experiments. Launchers in the current form consume the batch, launch experiments and return the batch of results.

As @Zhylkaaa mentioned, the batch can be extremely uneven leading to wasted resources. The proposed solution is meant to solve this feedback loop problem for uneven batches. We don't expect a GPU ordinal to influence final accuracy of a model, so we don't think that scheduling should take place in a sweeper. Also, we shouldn't expect that sweeper developers should take variety of different system architectures under consideration, while developing these plugins.
We don't want to defer scheduling solely to launchers, because it leads to wasted resources.

So this PR is meant to provide a further abstraction for scheduling/feedback loop, which we believe should be a layer between the sweeper and the launcher. Launchers are asynchronous right now and we don't want to mess with them too much. We discussed this approach with @Jasha10 , and came to the conclusion that our solution is decent enough. We probably won't have any more time to start with this from the ground up.

omry · 2023-02-10T16:19:45Z

I understand that you are looking for the simplest solution for your problem. However, introducing this solution would make subsequent improvements more difficult (you treating the Sequence as some kind of extension points is making this point more obvious). As I am no longer actively working on this project, I am not in a position to accept or reject this PR. In my opinion it's not a great idea because it's not a complete solution and it will make subsequent fixes harder.
Can you tell me how this solution works when the workers are in a different process or even machine than the sweeping process, hidden behind a particular Launcher implementation (For example they could be running on AWS instances via the Ray Launcher)?

Zhylkaaa · 2023-02-10T17:34:00Z

Hi @omry,

Well, I would argue that it took some time to arrive on this solution... and I am interested in how do you envision the complete solution?
Absolutely same way as previously, because current hydra implementation of Ray(AWS) launcher is inherently batched and I don't see the easy way to decouple it enough to report results as they arrive, but I am sure there is a way if someone is willing to incorporate Sequence behavior to Ray launcher (maybe @Jasha10 has some insight into how it works and can comment on that). It's much easier to explain for submitit launcher which also submits jobs to other nodes on the cluster:
Current implementation does something like this:
return [j.results()[0] for j in jobs] <-- results() call is blocking, meaning that we will wait for the whole batch of jobs
Now if we decouple that to

unfinished_jobs = jobs
while unfinished_jobs:
    finished_jobs, unfinished_jobs = wait_for_first(unfinished_jobs)
    experiment_sequence.update([finished_job.results() for finished_job in finished_jobs])

^ This will report results immediately as they come. And you also can launch new jobs with configurations sampled from updated study (with new results taken into account) by additionally writing something like:

unfinished_jobs = jobs
while unfinished_jobs:
    finished_jobs, unfinished_jobs = wait_for_first(unfinished_jobs)
    experiment_sequence.update([finished_job.results() for finished_job in finished_jobs])
    for next_job_config, _ in zip(experiment_sequence, range(batch_size - len(unfinished_jobs)):
        job = _launch_job(next_job_config)
        unfinished_jobs.append(job)

Probably another added benefit is that you can add custom class for ExperimentSequence and tailor slurm job config to not over allocate resources and utilize cluster nodes better (this also influences how fast your tasks will be scheduled). At least I think this is possible on job config side. (@Jasha10 correct me please if you can because I never worked with submitit only CLI sbatch)

We can add this feature to launchers that we know how update, but we need some kind of reassurance that this effort is worth something.
If you want to rebuild whole launcher+sweeper paradigm that exists now, I am afraid we cannot help you with that.

# Conflicts: # plugins/hydra_joblib_launcher/hydra_plugins/hydra_joblib_launcher/_core.py

Zhylkaaa · 2023-11-28T23:05:09Z

Hi @Jasha10 @omry
I would like to get back to the question of asynchronous runs, since I changed the lab I am working on, but problem of time wasting remained. Would you be interested in discussing this feature and solutions?
I am currently thinking about winter project and it seems like a very good candidate. can you please let me know what do you think.
I remember that there were different views on how this issues should be solved and I don't really see more backwards compatible way then one proposed in this PR, meaning introducing additional abstraction on trial suggestion, to make it appear as just an sequence of experiments for launcher, instead of managing launcher logic in sweeper.

I was also considering major refactor of multiprocessing launcher, but i am not sure this make sense?
Best regards

Jasha10 · 2023-11-30T04:25:18Z

Hi @Zhylkaaa, I'm no longer working at Meta -- Sorry to say that I don't have the bandwidth to give this feature the attention that it deserves.

Zhylkaaa · 2023-12-01T21:50:46Z

Hi @Jasha10, I am sorry to hear that. Is there any option when we can get back to it in foreseeable future?
I will maintain my fork of hydra with this functionality and try to keep it up to date as long as I can.
In terms of functionality left to implement I think there are 2 sweepers left as well as Ray and Slurm launcher. I will try to work on launchers in a month, because this should be relatively easy and start on sweepers as well.
Can you give your opinion about general idea on ExperimentSequence abstraction, because if the abstraction itself is acceptable we can work out details and exact implementations later.

Thank you for your time.
Best regards,
Dima

Jasha10 · 2023-12-05T19:56:35Z

Can you give your opinion about general idea on ExperimentSequence abstraction.

I seem to recall feeling that the abstraction was acceptable last time I looked at this PR. That being said, I do not completely understand the tradeoffs around @omry's AsyncLauncher idea. I think his most recent comment is suggesting that AsyncLauncher would be harder to implement later if ExperimentSequence is introduced.

You said earlier:

So, you propose to make launchers asynchronous and move all the scheduling and awaiting logic to sweeper? We were truing to actually avoid this to separate sweeper and launcher functionality as much as possible.

I will have to think about this... I am not clear at the moment about the advantages and disadvantages of the async API.

working prototype of experiment sequence

421293e

Zhylkaaa mentioned this pull request Nov 10, 2022

[Feature Request] implement MultiprocessingLauncher #2187

Open

add multiprocessing launcher & authors fix

587c509

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 22, 2022

Jasha10 reviewed Dec 2, 2022

View reviewed changes

Jasha10 marked this pull request as draft December 5, 2022 21:43

github-advanced-security bot found potential problems Dec 5, 2022

View reviewed changes

Jasha10 linked an issue Dec 5, 2022 that may be closed by this pull request

[Feature Request] Optuna experiment stream processing #2435

Open

Jasha10 requested changes Dec 5, 2022

View reviewed changes

fix compatibility and add test

f59ec7b

Zhylkaaa force-pushed the add_experiment_sequence branch from 162249f to f59ec7b Compare December 8, 2022 05:59

Jasha10 reviewed Dec 15, 2022

View reviewed changes

test fix and add headers

6c0f10f

Zhylkaaa requested a review from Jasha10 December 17, 2022 10:05

Jasha10 requested changes Jan 20, 2023

View reviewed changes

revert n_jobs in website

b9af790

Zhylkaaa added 2 commits November 27, 2023 23:41

Merge branch 'main' into add_experiment_sequence

b9ccf5a

# Conflicts: # plugins/hydra_joblib_launcher/hydra_plugins/hydra_joblib_launcher/_core.py

update requirements

3d2b441

	def test_optuna_example(with_commandline: bool, tmpdir: Path) -> None:
	def test_optuna_v2_example(with_commandline: bool, tmpdir: Path) -> None:

	from typing import List, Optional, Sequence, Union
	from typing import List, Optional, Sequence

[WIP] Working prototype of experiment sequence #2461

Are you sure you want to change the base?

[WIP] Working prototype of experiment sequence #2461

Conversation

Zhylkaaa commented Nov 10, 2022

Motivation

Have you read the Contributing Guidelines on pull requests?

Test Plan

Related Issues and PRs

lgtm-com bot commented Nov 10, 2022

Zhylkaaa commented Nov 10, 2022

Jasha10 commented Nov 11, 2022 • edited Loading

facebook-github-bot commented Nov 18, 2022

Action Required

Process

Zhylkaaa commented Nov 18, 2022

lgtm-com bot commented Nov 18, 2022

facebook-github-bot commented Nov 22, 2022

Jasha10 left a comment

Choose a reason for hiding this comment

Jasha10 Dec 2, 2022

Choose a reason for hiding this comment

github-advanced-security bot left a comment

Choose a reason for hiding this comment

lgtm-com bot commented Dec 5, 2022

Jasha10 left a comment • edited Loading

Choose a reason for hiding this comment

Jasha10 Dec 5, 2022

Choose a reason for hiding this comment

Jasha10 Dec 5, 2022 • edited Loading

Choose a reason for hiding this comment

lgtm-com bot commented Dec 8, 2022

Zhylkaaa commented Dec 8, 2022

Jasha10 commented Dec 13, 2022

Jasha10 left a comment

Choose a reason for hiding this comment

Jasha10 Dec 15, 2022

Choose a reason for hiding this comment

Jasha10 Dec 15, 2022

Choose a reason for hiding this comment

Zhylkaaa commented Dec 15, 2022

Jasha10 commented Dec 21, 2022

Jasha10 left a comment

Choose a reason for hiding this comment

Jasha10 Dec 18, 2022 • edited Loading

Choose a reason for hiding this comment

Zhylkaaa Jan 23, 2023

Choose a reason for hiding this comment

Zhylkaaa Jan 26, 2023

Choose a reason for hiding this comment

Jasha10 Dec 21, 2022

Choose a reason for hiding this comment

omry commented Jan 26, 2023 • edited Loading

Zhylkaaa commented Jan 26, 2023

omry commented Jan 27, 2023

Zhylkaaa commented Jan 27, 2023 • edited Loading

jbaczek commented Feb 8, 2023

omry commented Feb 10, 2023

Zhylkaaa commented Feb 10, 2023 • edited Loading

Zhylkaaa commented Nov 28, 2023

Jasha10 commented Nov 30, 2023

Zhylkaaa commented Dec 1, 2023 • edited Loading

Jasha10 commented Dec 5, 2023 • edited Loading

Jasha10 commented Nov 11, 2022 •

edited

Loading

Jasha10 left a comment •

edited

Loading

Jasha10 Dec 5, 2022 •

edited

Loading

Jasha10 Dec 18, 2022 •

edited

Loading

omry commented Jan 26, 2023 •

edited

Loading

Zhylkaaa commented Jan 27, 2023 •

edited

Loading

Zhylkaaa commented Feb 10, 2023 •

edited

Loading

Zhylkaaa commented Dec 1, 2023 •

edited

Loading

Jasha10 commented Dec 5, 2023 •

edited

Loading