Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow obtaining num_nodes from ClusterEnvironment #7361

Open
leezu opened this issue May 4, 2021 · 7 comments
Open

Allow obtaining num_nodes from ClusterEnvironment #7361

leezu opened this issue May 4, 2021 · 7 comments
Labels
environment feature Is an improvement or enhancement good first issue Good for newcomers help wanted Open to be worked on
Milestone

Comments

@leezu
Copy link
Contributor

leezu commented May 4, 2021

🚀 Feature

num_nodes must currently be specified manually by the user. However, the number of nodes is generally known in a cluster environment [1] and could be provided by and initialized from ClusterEnvironment

[1] For example $AWS_BATCH_JOB_NUM_NODES in https://docs.aws.amazon.com/batch/latest/userguide/multi-node-parallel-jobs.html

Pitch

https://github.com/PyTorchLightning/pytorch-lightning/blob/763a9a9495977b23cbd6a57f10253b662fd592a5/pytorch_lightning/plugins/training_type/ddp.py#L62-L75

could be updated to initialize num_nodes from ClusterEnvironment if ClusterEnvironment is provided and implements a num_nodes method.

cc @Borda @awaelchli @ananthsub

@leezu leezu added feature Is an improvement or enhancement help wanted Open to be worked on labels May 4, 2021
@ananthsub
Copy link
Contributor

ananthsub commented May 5, 2021

@leezu Could you describe why num_nodes being passed through the trainer isn't preferable vs setting it on the cluster environment? Using the argument from the trainer is the direction #7026 is going in

@leezu
Copy link
Contributor Author

leezu commented May 5, 2021

Because trainer.num_nodes needs to be setup explicitly, whereas cluster environment can provide the information automatically.
I don't think there's a setting where setting trainer.num_nodes to a different value than the true number of nodes provided by the ClusterEnvironment would be meaningful? Possibly trainer.num_nodes can be inferred from the ClusterEnvironment instead of doing the inference inside the training type plugin?

@ananthsub
Copy link
Contributor

I worry about 2 sources of truth here. num_nodes on the trainer works regardless of the cluster environment used. Only specific training type plugins currently require the cluster environment to be set, so I personally prefer keeping the trainer as the source of truth vs having overrides and also setting it on the cluster environment. @awaelchli what do you think?

@leezu
Copy link
Contributor Author

leezu commented May 5, 2021

so I personally prefer keeping the trainer as the source of truth vs having overrides and also setting it on the cluster environment

That's fine. But that doesn't preclude making num_nodes in the trainer optional and inferring it from the cluster environment if it wasn't specified explicitly. Probably a warning should also be raised if compute environment and trainer disagree on the number of nodes.

@awaelchli
Copy link
Contributor

awaelchli commented May 5, 2021

I have a refactor pending for SLURM #6303 . There, the behaviour is: If num_nodes doesn't match the environment, it goes into interactive mode. otherwise it launches batch mode.

For this I had planned:

class SomeClusterEnv:

def __init__(self, num_nodes: Optional, ...):
    # the cluster env can decide what to do here: 
    # 1) It can warn or error if the num_nodes are not set correctly 
    # 2) It can infer num nodes from environment automatically
    # 3) something else (e.g. SLURM logic)

def num_nodes() -> int:
    return ...

And SLURM implements the above as described.

For the remaining code base of Lightning, the num nodes get derived from Cluster env.
This will address also the concern of @ananthsub regarding from how many places the information comes from.

Let me know what you think.

@awaelchli awaelchli self-assigned this May 5, 2021
@leezu
Copy link
Contributor Author

leezu commented May 5, 2021

Thank you @awaelchli. Can you elaborate more on 2) It can infer num nodes from environment automatically? Is this expected to make trainer.num_nodes optional and infer from environment if it wasn't specified? Or is this local to the cluster environment object?

@awaelchli
Copy link
Contributor

awaelchli commented May 6, 2021

The idea would be to make Trainer(num_nodes=...) optional depending on the plugin yes, and then trainer.num_nodes will always be what the Environment determines. Not sure what that means for SLURM interactive mode but should be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
environment feature Is an improvement or enhancement good first issue Good for newcomers help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

5 participants