-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ddp tests + .test() #2512
Fix ddp tests + .test() #2512
Conversation
Hello @williamFalcon! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-07-07 16:09:33 UTC |
Codecov Report
@@ Coverage Diff @@
## master #2512 +/- ##
=======================================
+ Coverage 88% 90% +1%
=======================================
Files 69 69
Lines 5629 5669 +40
=======================================
+ Hits 4964 5077 +113
+ Misses 665 592 -73 |
@@ -163,6 +165,10 @@ def train_fx(trial_hparams, cluster_manager, _): | |||
else: | |||
XLA_AVAILABLE = True | |||
|
|||
pid = os.getpid() | |||
rng1 = np.random.RandomState(pid) | |||
RANDOM_PORTS = rng1.randint(10000, 19999, 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this cause a failure for a distributed cluster > 100 nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the random_port thing is only used in non-multi node ddp.
torch.cuda.empty_cache() | ||
|
||
if self.global_rank == 0 and q is not None: | ||
q.put(self.checkpoint_callback.best_model_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels hacky, what are we trying to do here? return the state of a callback to the main node? why put this specific attribute in the queue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think he did this so that one can call .test in the main process, which will access the best model to test on. But I agree with you, this seems fragile and dangerous to modify state across processes, there's gotta be a better way. Could one put the whole trainer in the queue in theory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i tried that but it didn’t work. i think in a different PR we can do something like state_dict for the trainer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree this isn't optimal but let's get a release that fixes all the test issues (which this PR does) and then we can figure out a longer term strategy
@@ -22,7 +22,7 @@ def wrapped_fn(*args, **kwargs): | |||
|
|||
|
|||
def _warn(*args, **kwargs): | |||
warnings.warn(*args, **kwargs) | |||
warnings.warn(UserWarning(*args, **kwargs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this causes the failing test on deprecation warnings, @williamFalcon
@@ -377,17 +384,18 @@ def set_nvidia_flags(self, is_slurm_managing_tasks, data_parallel_device_ids): | |||
# don't make this debug... this is good UX | |||
rank_zero_info(f'CUDA_VISIBLE_DEVICES: [{os.environ["CUDA_VISIBLE_DEVICES"]}]') | |||
|
|||
def set_random_port(self): | |||
def set_random_port(self, force=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyjordan this function is only ever called from ddp on a single node... not distributed
rng1 = np.random.RandomState(pid) | ||
default_port = rng1.randint(10000, 19999, 1)[0] | ||
# pick a random port first | ||
assert self.num_nodes == 1, 'random port can only be called from single node training' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyjordan added this to make sure it's used as expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm understanding this, it looks like this will disable multi-node support (at least, I'm not able to run across multiple nodes anymore due to this assertion - see issue here: flatironinstitute/deepblast#46)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mortonjt can you open a github issue about this and explain how you launched your script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing. See #2578
@@ -446,15 +455,24 @@ def spawn_ddp_children(self, model): | |||
sleep(delay) | |||
|
|||
local_rank = 0 | |||
self.ddp_train(local_rank, model, is_master=True) | |||
results = self.ddp_train(local_rank, q=None, model=model, is_master=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather longer var name the q
@tgaddair hey! any chance you can look at this? |
Hey @williamFalcon, what was the issue? I see this PR has been merged. Is there still something that needs to be fixed on the Horovod side for this? |
I don't think this is Horovod's fault. Something witht the checkpoint paths changed. The tests are now simply skipped, which is not good :( |
perfect! @awaelchli . |
|
||
assert event_acc.summary_metadata['_hparams_/experiment'].plugin_data.plugin_name == 'hparams' | ||
assert event_acc.summary_metadata['_hparams_/experiment'].plugin_data.content == hparams_data | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a reminder in the related issue #2371 that we should fix this
@@ -88,6 +88,7 @@ def test_horovod_cpu_implicit(tmpdir): | |||
_run_horovod(trainer_options) | |||
|
|||
|
|||
@pytest.mark.skipif(True, reason="fix hv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just @pytest.mark.skip(reason="fix hv")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No description provided.