Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up environment access in plugins #6941

Merged
merged 66 commits into from
Apr 13, 2021
Merged

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Apr 10, 2021

What does this PR do?

Fixes #6853

Fixes

  • delayed availability rank information (global, local)
  • enables torch elastic fault tolerance
  • potentially some reports of ddp hanging

Credit: Proposed by @ananthsub

With this PR:

trainer = Trainer(accelerator=...)
trainer.global_rank  # this is now returning the correct rank on clusters, not only just at trainer.fit()!

TODO:

  • Update TPU plugins

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@pep8speaks
Copy link

pep8speaks commented Apr 10, 2021

Hello @awaelchli! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-04-13 17:39:06 UTC

@codecov
Copy link

codecov bot commented Apr 10, 2021

Codecov Report

Merging #6941 (7d39a92) into master (80c5293) will decrease coverage by 9%.
The diff coverage is 83%.

@@           Coverage Diff            @@
##           master   #6941     +/-   ##
========================================
- Coverage      92%     83%     -9%     
========================================
  Files         194     194             
  Lines       12366   13132    +766     
========================================
- Hits        11350   10897    -453     
- Misses       1016    2235   +1219     


@abstractmethod
def global_rank(self) -> int:
""" The rank (index) of the currently running process across all nodes and devices. """
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be pass too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't make a difference. pass is only required when we have nothing under the function. here the docstring is already enough :)

Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carmocca regarding getter/setter, what's your recommendation. Doing it in this PR would touch the full interface.

Let's do it after this one

@@ -78,11 +78,11 @@ def __init__(
self._ddp_kwargs = kwargs
self._has_spawned_children = False
self.task_idx = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this?

@@ -34,6 +35,8 @@ class LightningEnvironment(ClusterEnvironment):
def __init__(self):
super().__init__()
self._master_port = None
self._global_rank: int = 0
self._world_size: Optional[int] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can a world size be None? Isn't it 1 then? Also what happens if this is None? Do we check this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we test it in tests/plugins/environments/test_lightning_environments
you are right, it could be 1 by default now.
we never see None because at trainer init we immediately let the plugin overwrite it to an int.

Comment on lines 80 to 81
# no-op, we are not allowed to change rank in SLURM
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better raise an Error here to make that more explicit?

Because if I call a function like this and it doesn't error I expect the new rank to be set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, but then we would have to make the call to the setter conditional in the training plugins.
For example, in DDPSpawnPlugin we use the setter to update the global rank once we have spawned the new processes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make add log an error here to warn users but not raise an exception?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a warning would be fine. Might be a good time to introduce several warn / log level depending on risks.

Copy link
Contributor Author

@awaelchli awaelchli Apr 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went for log.debug for now. I don't want to risk users seeing error messages pop up that they don't understand. The team wants to include this PR in in the patch release today. I could not find a smarter way to avoid the setter in the limited time.

def __init__(self):
super().__init__()
@staticmethod
def is_using_torchelastic() -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we have something similar for slurm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have, it's in the accelerator connector. and I will do a follow up. didn't want to make the PR larger

Comment on lines +37 to +46
def global_rank(self) -> int:
return hvd.rank()

@property
def local_rank(self) -> int:
return hvd.local_rank()

@property
def world_size(self) -> int:
return hvd.size()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be cleaner to also have a horovod environment plugin here? since for this part it's similar to torchelastic and should also be handled like that.

Comment on lines +98 to +99
"SLURM_PROCID": "1",
"SLURM_LOCALID": "1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Do you need to change these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these were artificial values, and they don't match with what we expect.
SLURM_NTASKS=2 but LOCAL_RANK 10 is not valid, and would actually lead to Lightning ignoring the SlurmEnvironment and instead select the LightningEnvironment (there is a comment in the code, we should document that behavior better, and write proper tests instead of these misleading ones that were here)

tests/plugins/test_common.py Outdated Show resolved Hide resolved
@@ -302,8 +302,8 @@ def root_gpu(self) -> Optional[int]:

@property
def is_using_torchelastic(self) -> bool:
te_flags_passed = "WORLD_SIZE" in os.environ and ("GROUP_RANK" in os.environ or "NODE_RANK" in os.environ)
return te_flags_passed
required_env_vars = ("RANK", "GROUP_RANK", "LOCAL_RANK")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about WORLD_SIZE ?

def world_size(self) -> int:
return self._world_size

def set_world_size(self, size: int) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need set_world_size. Can you just use a setter and make everything properties ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but this interface is old and all getters are methods from the beginning.
It would be better to have proper setters and getters. I decided not to keep backward compatibility and make the interface consistent. I propose to do a deprecation in a follow up to keep this PR managable

Comment on lines 80 to 81
# no-op, we are not allowed to change rank in SLURM
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a warning would be fine. Might be a good time to introduce several warn / log level depending on risks.

@awaelchli awaelchli merged commit 33cc9fe into master Apr 13, 2021
@awaelchli awaelchli deleted the bugfix/elastic-world-size branch April 13, 2021 18:07
@awaelchli
Copy link
Contributor Author

Great, one more headache solved. A few items from reviews that I could not address fully will be tracked here: #6303
I will make sure these concerns get sorted out. Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working design Includes a design discussion distributed Generic distributed-related topic priority: 0 High priority task
Projects
None yet
Development

Successfully merging this pull request may close these issues.

global process count incorrect with elastic, fault tolerant training
10 participants