Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tasks_per_node to split sweep across tasks #2633

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

odelalleau
Copy link
Collaborator

@odelalleau odelalleau commented Apr 4, 2023

Motivation

When running a sweep, someone may want to be able to use the same GPU for multiple jobs in a sweep. This PR makes it possible by leveraging the tasks_per_node argument (if set to 2 for instance, then 2 jobs may share the same GPU).

Discussion

This is currently a draft, open for feedback. I don't think it's actually a good idea to systematically use tasks_per_node for this, because some users may be using this setting for multiprocess jobs.

Two options could be:

  1. Add another flag split_sweep_over_tasks (default=False) to enable this behavior (my preferred solution at this time)
  2. Make it an entirely different setting (ex: jobs_group_size, default=1) so that it can be combined with multi-task jobs (would be more complex to implement: would need to spawn multiple processes from each SLURM job, instead of just relying on SLURM's tasks mechanism as implemented here)

Feedback and other ideas welcome!

The current implementation also has a small hack when we end up launching a single job => not sure if there's a better way to deal with this situation (basically I would like to force submitit to create a job array even for a single-job array).

Have you read the Contributing Guidelines on pull requests?

Yes

Test Plan

TBD

Related Issues and PRs

Fixes #2632

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 4, 2023
@odelalleau odelalleau marked this pull request as draft April 4, 2023 16:05
@odelalleau
Copy link
Collaborator Author

@Jasha10 and @jrapin what do you think?

@odelalleau odelalleau force-pushed the od/submitit-multi-task branch from e119fcd to 016dc28 Compare April 4, 2023 16:08
@soerenab
Copy link

soerenab commented Feb 29, 2024

This would be a very useful feature to have! Will this be merged into the main branch at some point?

In my particular case, slurm is configured to allocate a full node per job, where each node comes with 4 gpus. My models are quite small though and easily fit on a single gpu. Having the hydra sweeper submitting a new job (which seems to be the default at the moment) per hyperparameter value is hence very wasteful for me whereas parallelizing within a slurm job (and hence a single node) across tasks sounds exactly like the thing I am looking for.

If there is a different solution to this, I would of course also be interested in that. Thank you very much!

@odelalleau
Copy link
Collaborator Author

Will this be merged into the main branch at some point?

I wouldn't count on it -- I currently don't need it anymore and as I mentioned in the PR description, I think the current implementation may break some existing use cases. So it would require someone to re-work it a bit (for instance with one of my suggestions, but maybe there's a better way too).

Note however that I've used it successfully so you should be able to cherry-pick this commit and use it if it's helpful to you.

@chaithyagr
Copy link

This is a very useful feature. Can we try to work on this and get it merged? Anyone else is interested in this?

@chaithyagr
Copy link

Can I implement option 1) and then can we hope this can be merged to mainline?

@chaithyagr
Copy link

@Jasha10 @odelalleau What do you think? Is it possible to get this up as discussed about?

@matteobettini
Copy link

matteobettini commented Jul 3, 2024

Is tasks_per_node currently usable for anything else in the hydra plugin? It seems to me that this is the only envisionable use case for it in hydra

@chaithyagr
Copy link

Yes, I agree it seems like a specific usecase within hydra. But it is a wonderful usecase when we want to run 5-6 jobs within one node without worrying too much. In particular, my usecase is a large multirun, but with a quick arg, I can just run N number of tasks on a node (this translates to sharing GPU resources, when the indivitual tasks are not GPU intensive).

@matteobettini
Copy link

Yes I totally agree and would need the same thing. What I wanted to ask is what was the effect of tasks_per_node before this PR, as this seemed unused in hydra

@chaithyagr
Copy link

Well, to me it seems like something which cannot technically work with the hydra framework. But its a broader question whether to make sure that the tasks_per_node still reflects what SLURM defines it as pointed by @odelalleau .

@chaithyagr
Copy link

Can we try to merge this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Let multiple jobs share the same GPU when launching a sweep on SLURM with submitit
5 participants