-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow obtaining num_nodes from ClusterEnvironment #7361
Comments
Because trainer.num_nodes needs to be setup explicitly, whereas cluster environment can provide the information automatically. |
I worry about 2 sources of truth here. |
That's fine. But that doesn't preclude making |
I have a refactor pending for SLURM #6303 . There, the behaviour is: If For this I had planned: class SomeClusterEnv:
def __init__(self, num_nodes: Optional, ...):
# the cluster env can decide what to do here:
# 1) It can warn or error if the num_nodes are not set correctly
# 2) It can infer num nodes from environment automatically
# 3) something else (e.g. SLURM logic)
def num_nodes() -> int:
return ... And SLURM implements the above as described. For the remaining code base of Lightning, the num nodes get derived from Cluster env. Let me know what you think. |
Thank you @awaelchli. Can you elaborate more on |
The idea would be to make |
🚀 Feature
num_nodes must currently be specified manually by the user. However, the number of nodes is generally known in a cluster environment [1] and could be provided by and initialized from ClusterEnvironment
[1] For example $AWS_BATCH_JOB_NUM_NODES in https://docs.aws.amazon.com/batch/latest/userguide/multi-node-parallel-jobs.html
Pitch
https://github.com/PyTorchLightning/pytorch-lightning/blob/763a9a9495977b23cbd6a57f10253b662fd592a5/pytorch_lightning/plugins/training_type/ddp.py#L62-L75
could be updated to initialize num_nodes from ClusterEnvironment if ClusterEnvironment is provided and implements a
num_nodes
method.cc @Borda @awaelchli @ananthsub
The text was updated successfully, but these errors were encountered: