Use `tasks_per_node` to split sweep across tasks #2633

odelalleau · 2023-04-04T16:04:56Z

Motivation

When running a sweep, someone may want to be able to use the same GPU for multiple jobs in a sweep. This PR makes it possible by leveraging the tasks_per_node argument (if set to 2 for instance, then 2 jobs may share the same GPU).

Discussion

This is currently a draft, open for feedback. I don't think it's actually a good idea to systematically use tasks_per_node for this, because some users may be using this setting for multiprocess jobs.

Two options could be:

Add another flag split_sweep_over_tasks (default=False) to enable this behavior (my preferred solution at this time)
Make it an entirely different setting (ex: jobs_group_size, default=1) so that it can be combined with multi-task jobs (would be more complex to implement: would need to spawn multiple processes from each SLURM job, instead of just relying on SLURM's tasks mechanism as implemented here)

Feedback and other ideas welcome!

The current implementation also has a small hack when we end up launching a single job => not sure if there's a better way to deal with this situation (basically I would like to force submitit to create a job array even for a single-job array).

Have you read the Contributing Guidelines on pull requests?

Yes

Test Plan

TBD

Related Issues and PRs

Fixes #2632

odelalleau · 2023-04-04T16:05:38Z

@Jasha10 and @jrapin what do you think?

soerenab · 2024-02-29T18:45:31Z

This would be a very useful feature to have! Will this be merged into the main branch at some point?

In my particular case, slurm is configured to allocate a full node per job, where each node comes with 4 gpus. My models are quite small though and easily fit on a single gpu. Having the hydra sweeper submitting a new job (which seems to be the default at the moment) per hyperparameter value is hence very wasteful for me whereas parallelizing within a slurm job (and hence a single node) across tasks sounds exactly like the thing I am looking for.

If there is a different solution to this, I would of course also be interested in that. Thank you very much!

odelalleau · 2024-03-01T01:06:42Z

Will this be merged into the main branch at some point?

I wouldn't count on it -- I currently don't need it anymore and as I mentioned in the PR description, I think the current implementation may break some existing use cases. So it would require someone to re-work it a bit (for instance with one of my suggestions, but maybe there's a better way too).

Note however that I've used it successfully so you should be able to cherry-pick this commit and use it if it's helpful to you.

chaithyagr · 2024-05-27T09:49:52Z

This is a very useful feature. Can we try to work on this and get it merged? Anyone else is interested in this?

chaithyagr · 2024-05-27T09:57:07Z

Can I implement option 1) and then can we hope this can be merged to mainline?

chaithyagr · 2024-07-03T11:43:00Z

@Jasha10 @odelalleau What do you think? Is it possible to get this up as discussed about?

matteobettini · 2024-07-03T13:36:29Z

Is tasks_per_node currently usable for anything else in the hydra plugin? It seems to me that this is the only envisionable use case for it in hydra

chaithyagr · 2024-07-03T13:44:09Z

Yes, I agree it seems like a specific usecase within hydra. But it is a wonderful usecase when we want to run 5-6 jobs within one node without worrying too much. In particular, my usecase is a large multirun, but with a quick arg, I can just run N number of tasks on a node (this translates to sharing GPU resources, when the indivitual tasks are not GPU intensive).

matteobettini · 2024-07-03T13:47:53Z

Yes I totally agree and would need the same thing. What I wanted to ask is what was the effect of tasks_per_node before this PR, as this seemed unused in hydra

chaithyagr · 2024-07-03T13:58:48Z

Well, to me it seems like something which cannot technically work with the hydra framework. But its a broader question whether to make sure that the tasks_per_node still reflects what SLURM defines it as pointed by @odelalleau .

chaithyagr · 2024-11-15T11:32:41Z

Can we try to merge this?

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 4, 2023

odelalleau marked this pull request as draft April 4, 2023 16:05

Use tasks_per_node to split sweep across tasks

016dc28

odelalleau force-pushed the od/submitit-multi-task branch from e119fcd to 016dc28 Compare April 4, 2023 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `tasks_per_node` to split sweep across tasks #2633

Use `tasks_per_node` to split sweep across tasks #2633

odelalleau commented Apr 4, 2023 •

edited

Loading

odelalleau commented Apr 4, 2023

soerenab commented Feb 29, 2024 •

edited

Loading

odelalleau commented Mar 1, 2024

chaithyagr commented May 27, 2024

chaithyagr commented May 27, 2024

chaithyagr commented Jul 3, 2024

matteobettini commented Jul 3, 2024 •

edited

Loading

chaithyagr commented Jul 3, 2024

matteobettini commented Jul 3, 2024

chaithyagr commented Jul 3, 2024

chaithyagr commented Nov 15, 2024

Use tasks_per_node to split sweep across tasks #2633

Are you sure you want to change the base?

Use tasks_per_node to split sweep across tasks #2633

Conversation

odelalleau commented Apr 4, 2023 • edited Loading

Motivation

Discussion

Have you read the Contributing Guidelines on pull requests?

Test Plan

Related Issues and PRs

odelalleau commented Apr 4, 2023

soerenab commented Feb 29, 2024 • edited Loading

odelalleau commented Mar 1, 2024

chaithyagr commented May 27, 2024

chaithyagr commented May 27, 2024

chaithyagr commented Jul 3, 2024

matteobettini commented Jul 3, 2024 • edited Loading

chaithyagr commented Jul 3, 2024

matteobettini commented Jul 3, 2024

chaithyagr commented Jul 3, 2024

chaithyagr commented Nov 15, 2024

Use `tasks_per_node` to split sweep across tasks #2633

Use `tasks_per_node` to split sweep across tasks #2633

odelalleau commented Apr 4, 2023 •

edited

Loading

soerenab commented Feb 29, 2024 •

edited

Loading

matteobettini commented Jul 3, 2024 •

edited

Loading