[Feature Request] Hydra ax plugin: Allow configuring max_parallelism (out of memory workaround) #1837

Harimus · 2021-09-27T09:27:31Z

🚀 Feature Request

This feature request stem from a question I made on Stackoverflow regarding a Memoryleak(ish) pattern I saw in my code.

Summary of the question(problem) I had and the answer I got:

I was trying to run hydra ax-sweeper for Hyperparameter optimization on Minerl environment dataset. This dataset is quite big, around 30Gb+ when fully loaded to the RAM, and I utilized multiprocessing to speed up this loading process.

After about 2-3 runs the code freezes, at the data loading phase. When I check the memory usage in the system it's at the max (90Gb RAM + swap memory), and therefore can't load anything, and my question was around how memory is allocated/deallocated between ax-sweeper runs. My suspicion is that it is not allocated/deallocated as I initially thought (that is, I assume main(cfg: DictConfig) with @hydra.main(...) is run until termination between each sweeps, sequentially)

The answer I got was:

The hydra-ax-sweeper may run trials in parallel, depending on the result of calling the get_max_parallelism function defined in ax.service.ax_client. I suspect that your machine is running out of memory because of this parallelism.

Hydra's Ax plugin does not currently have a config group for configuring this max_parallelism setting, so it is automatically set by ax.

with a quick workaround to move the data loading step outside the main(). This could ofc be doable but I'm loading the data based on the config file parameters so this would mean I have to move all that outside the entire hydra pipeline.

Motivation

Being able to remove the parallelism between sweeper might make it easier for some user, as in I'm fairly certain parallel execution as default is not what most people imagine is happening.

Additional context

I'm pretty newbie when it comes to how AX work, so I might be misunderstanding stuff here and if that's the case, I'm sorry for bothering.

The text was updated successfully, but these errors were encountered:

jieru-hu · 2021-09-27T20:23:57Z

thanks @Harimus

it seems like if we add choose_generation_strategy_kwargs to the ax config, we should be able to configure the max_parallelism

would you be willing to try something like this locally and see if it helps? When i ran this locally, ax was launching one job at a time.

diff --git a/plugins/hydra_ax_sweeper/example/conf/config.yaml b/plugins/hydra_ax_sweeper/example/conf/config.yaml
index 4bbd69bfb..a59d6c03d 100644
--- a/plugins/hydra_ax_sweeper/example/conf/config.yaml
+++ b/plugins/hydra_ax_sweeper/example/conf/config.yaml
@@ -19,6 +19,8 @@ hydra:
       experiment:
         # Default to minimize, set to false to maximize
         minimize: true
+        choose_generation_strategy_kwargs:
+           max_parallelism_cap: 1
 
       early_stop:
         # Number of epochs without a significant improvement from
diff --git a/plugins/hydra_ax_sweeper/hydra_plugins/hydra_ax_sweeper/config.py b/plugins/hydra_ax_sweeper/hydra_plugins/hydra_ax_sweeper/config.py
index 17eb0d756..dbc3eb44c 100644
--- a/plugins/hydra_ax_sweeper/hydra_plugins/hydra_ax_sweeper/config.py
+++ b/plugins/hydra_ax_sweeper/hydra_plugins/hydra_ax_sweeper/config.py
@@ -30,6 +30,7 @@ class ExperimentConfig:
     parameter_constraints: Optional[List[str]] = None
     outcome_constraints: Optional[List[str]] = None
     status_quo: Optional[Dict[str, Any]] = None
+    choose_generation_strategy_kwargs: Optional[Dict[str, Any]] = None

Harimus · 2021-12-04T18:20:31Z

Hi @jieru-hu Sorry for the late reply to this process. The project I was working on ended (abruptly).

But I got a similar(?) issue, probably related to parallelism in AX-sweeper and I tried the above changes to no avail.

Maybe I should start a new issue for that problem, but a short explanation is that inside the @hydra.main() run I was also using multiprocessing.Pool, and somehow all of the subprocesses froze at some arbitrary data_as_tensor = torch.as_tensor(numpy_data) (although one of the torch.as_tensor() worked, the second call to it just froze).
The same problem went away if I: used the base sweeper (no parallelism), or changing multiprocess.Pool to multiprocess.pool.ThreadPool (no parallelism) while using ax-sweeper. I tried the original code (using multiprocess library) with the solution you mentioned but the freezing still happened.

It became late, but thank you for the prompt answer!
(As for now I'll just be moving my code to ThreadPool so I guess this issue can be closed)

Harimus added the enhancement Enhanvement request label Sep 27, 2021

jieru-hu added the plugin Plugins realted issues label Sep 27, 2021

lwelzel mentioned this issue Oct 29, 2024

[Feature Request] Add a generation strategy config to ax-sweeper plugin #2977

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Hydra ax plugin: Allow configuring max_parallelism (out of memory workaround) #1837

[Feature Request] Hydra ax plugin: Allow configuring max_parallelism (out of memory workaround) #1837

Harimus commented Sep 27, 2021

jieru-hu commented Sep 27, 2021

Harimus commented Dec 4, 2021 •

edited

Loading

[Feature Request] Hydra ax plugin: Allow configuring max_parallelism (out of memory workaround) #1837

[Feature Request] Hydra ax plugin: Allow configuring max_parallelism (out of memory workaround) #1837

Comments

Harimus commented Sep 27, 2021

🚀 Feature Request

Summary of the question(problem) I had and the answer I got:

Motivation

Additional context

jieru-hu commented Sep 27, 2021

Harimus commented Dec 4, 2021 • edited Loading

Harimus commented Dec 4, 2021 •

edited

Loading