Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Hydra ax plugin: Allow configuring max_parallelism (out of memory workaround) #1837

Open
Harimus opened this issue Sep 27, 2021 · 2 comments
Labels
enhancement Enhanvement request plugin Plugins realted issues

Comments

@Harimus
Copy link

Harimus commented Sep 27, 2021

🚀 Feature Request

This feature request stem from a question I made on Stackoverflow regarding a Memoryleak(ish) pattern I saw in my code.

Summary of the question(problem) I had and the answer I got:

I was trying to run hydra ax-sweeper for Hyperparameter optimization on Minerl environment dataset. This dataset is quite big, around 30Gb+ when fully loaded to the RAM, and I utilized multiprocessing to speed up this loading process.

After about 2-3 runs the code freezes, at the data loading phase. When I check the memory usage in the system it's at the max (90Gb RAM + swap memory), and therefore can't load anything, and my question was around how memory is allocated/deallocated between ax-sweeper runs. My suspicion is that it is not allocated/deallocated as I initially thought (that is, I assume main(cfg: DictConfig) with @hydra.main(...) is run until termination between each sweeps, sequentially)

The answer I got was:

The hydra-ax-sweeper may run trials in parallel, depending on the result of calling the get_max_parallelism function defined in ax.service.ax_client. I suspect that your machine is running out of memory because of this parallelism.

Hydra's Ax plugin does not currently have a config group for configuring this max_parallelism setting, so it is automatically set by ax.

with a quick workaround to move the data loading step outside the main(). This could ofc be doable but I'm loading the data based on the config file parameters so this would mean I have to move all that outside the entire hydra pipeline.

Motivation

Being able to remove the parallelism between sweeper might make it easier for some user, as in I'm fairly certain parallel execution as default is not what most people imagine is happening.

Additional context

I'm pretty newbie when it comes to how AX work, so I might be misunderstanding stuff here and if that's the case, I'm sorry for bothering.

@Harimus Harimus added the enhancement Enhanvement request label Sep 27, 2021
@jieru-hu
Copy link
Contributor

thanks @Harimus

it seems like if we add choose_generation_strategy_kwargs to the ax config, we should be able to configure the max_parallelism

would you be willing to try something like this locally and see if it helps? When i ran this locally, ax was launching one job at a time.

diff --git a/plugins/hydra_ax_sweeper/example/conf/config.yaml b/plugins/hydra_ax_sweeper/example/conf/config.yaml
index 4bbd69bfb..a59d6c03d 100644
--- a/plugins/hydra_ax_sweeper/example/conf/config.yaml
+++ b/plugins/hydra_ax_sweeper/example/conf/config.yaml
@@ -19,6 +19,8 @@ hydra:
       experiment:
         # Default to minimize, set to false to maximize
         minimize: true
+        choose_generation_strategy_kwargs:
+           max_parallelism_cap: 1
 
       early_stop:
         # Number of epochs without a significant improvement from
diff --git a/plugins/hydra_ax_sweeper/hydra_plugins/hydra_ax_sweeper/config.py b/plugins/hydra_ax_sweeper/hydra_plugins/hydra_ax_sweeper/config.py
index 17eb0d756..dbc3eb44c 100644
--- a/plugins/hydra_ax_sweeper/hydra_plugins/hydra_ax_sweeper/config.py
+++ b/plugins/hydra_ax_sweeper/hydra_plugins/hydra_ax_sweeper/config.py
@@ -30,6 +30,7 @@ class ExperimentConfig:
     parameter_constraints: Optional[List[str]] = None
     outcome_constraints: Optional[List[str]] = None
     status_quo: Optional[Dict[str, Any]] = None
+    choose_generation_strategy_kwargs: Optional[Dict[str, Any]] = None

@jieru-hu jieru-hu added the plugin Plugins realted issues label Sep 27, 2021
@Harimus
Copy link
Author

Harimus commented Dec 4, 2021

Hi @jieru-hu Sorry for the late reply to this process. The project I was working on ended (abruptly).

But I got a similar(?) issue, probably related to parallelism in AX-sweeper and I tried the above changes to no avail.

Maybe I should start a new issue for that problem, but a short explanation is that inside the @hydra.main() run I was also using multiprocessing.Pool, and somehow all of the subprocesses froze at some arbitrary data_as_tensor = torch.as_tensor(numpy_data) (although one of the torch.as_tensor() worked, the second call to it just froze).
The same problem went away if I: used the base sweeper (no parallelism), or changing multiprocess.Pool to multiprocess.pool.ThreadPool (no parallelism) while using ax-sweeper. I tried the original code (using multiprocess library) with the solution you mentioned but the freezing still happened.

It became late, but thank you for the prompt answer!
(As for now I'll just be moving my code to ThreadPool so I guess this issue can be closed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhanvement request plugin Plugins realted issues
Projects
None yet
Development

No branches or pull requests

2 participants