Fix ddp tests + .test() #2512

williamFalcon · 2020-07-05T11:53:16Z

No description provided.

pep8speaks · 2020-07-05T11:53:19Z

Hello @williamFalcon! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-07-07 16:09:33 UTC

codecov · 2020-07-05T12:32:37Z

Codecov Report

Merging #2512 into master will increase coverage by 1%.
The diff coverage is 82%.

@@           Coverage Diff           @@
##           master   #2512    +/-   ##
=======================================
+ Coverage      88%     90%    +1%     
=======================================
  Files          69      69            
  Lines        5629    5669    +40     
=======================================
+ Hits         4964    5077   +113     
+ Misses        665     592    -73

jeremyjordan · 2020-07-07T02:09:25Z

pytorch_lightning/trainer/distrib_data_parallel.py

@@ -163,6 +165,10 @@ def train_fx(trial_hparams, cluster_manager, _):
 else:
    XLA_AVAILABLE = True

+pid = os.getpid()
+rng1 = np.random.RandomState(pid)
+RANDOM_PORTS = rng1.randint(10000, 19999, 100)


does this cause a failure for a distributed cluster > 100 nodes?

the random_port thing is only used in non-multi node ddp.

jeremyjordan · 2020-07-07T02:11:17Z

pytorch_lightning/trainer/distrib_data_parallel.py

+        torch.cuda.empty_cache()
+
+        if self.global_rank == 0 and q is not None:
+            q.put(self.checkpoint_callback.best_model_path)


this feels hacky, what are we trying to do here? return the state of a callback to the main node? why put this specific attribute in the queue?

I think he did this so that one can call .test in the main process, which will access the best model to test on. But I agree with you, this seems fragile and dangerous to modify state across processes, there's gotta be a better way. Could one put the whole trainer in the queue in theory?

i tried that but it didn’t work. i think in a different PR we can do something like state_dict for the trainer.

i agree this isn't optimal but let's get a release that fixes all the test issues (which this PR does) and then we can figure out a longer term strategy

pytorch_lightning/trainer/trainer.py

Borda · 2020-07-07T06:40:25Z

pytorch_lightning/utilities/distributed.py

@@ -22,7 +22,7 @@ def wrapped_fn(*args, **kwargs):


 def _warn(*args, **kwargs):
-    warnings.warn(*args, **kwargs)
+    warnings.warn(UserWarning(*args, **kwargs))


this causes the failing test on deprecation warnings, @williamFalcon

williamFalcon · 2020-07-07T10:57:39Z

pytorch_lightning/trainer/distrib_data_parallel.py

@@ -377,17 +384,18 @@ def set_nvidia_flags(self, is_slurm_managing_tasks, data_parallel_device_ids):
        # don't make this debug... this is good UX
        rank_zero_info(f'CUDA_VISIBLE_DEVICES: [{os.environ["CUDA_VISIBLE_DEVICES"]}]')

-    def set_random_port(self):
+    def set_random_port(self, force=False):


@jeremyjordan this function is only ever called from ddp on a single node... not distributed

williamFalcon · 2020-07-07T10:58:48Z

pytorch_lightning/trainer/distrib_data_parallel.py

-            rng1 = np.random.RandomState(pid)
-            default_port = rng1.randint(10000, 19999, 1)[0]
+        # pick a random port first
+        assert self.num_nodes == 1, 'random port can only be called from single node training'


@jeremyjordan added this to make sure it's used as expected

I'm understanding this, it looks like this will disable multi-node support (at least, I'm not able to run across multiple nodes anymore due to this assertion - see issue here: flatironinstitute/deepblast#46)

@mortonjt can you open a github issue about this and explain how you launched your script?

Sure thing. See #2578

Borda · 2020-07-07T15:33:25Z

pytorch_lightning/trainer/distrib_data_parallel.py

@@ -446,15 +455,24 @@ def spawn_ddp_children(self, model):
            sleep(delay)

        local_rank = 0
-        self.ddp_train(local_rank, model, is_master=True)
+        results = self.ddp_train(local_rank, q=None, model=model, is_master=True)


rather longer var name the q

williamFalcon · 2020-07-07T15:42:49Z

@tgaddair hey! any chance you can look at this?

tgaddair · 2020-07-07T16:26:25Z

Hey @williamFalcon, what was the issue? I see this PR has been merged. Is there still something that needs to be fixed on the Horovod side for this?

awaelchli · 2020-07-07T16:36:57Z

I don't think this is Horovod's fault. Something witht the checkpoint paths changed. The tests are now simply skipped, which is not good :(
I will try to fix it in #2514 wish me luck :)

williamFalcon · 2020-07-07T16:42:56Z

perfect! @awaelchli .
@tgaddair looks like it's on our end. All good! thanks for taking a look though

Borda · 2020-07-07T17:00:10Z

tests/loggers/test_tensorboard.py

-
-    assert event_acc.summary_metadata['_hparams_/experiment'].plugin_data.plugin_name == 'hparams'
-    assert event_acc.summary_metadata['_hparams_/experiment'].plugin_data.content == hparams_data
+    #


I added a reminder in the related issue #2371 that we should fix this

Borda · 2020-07-07T17:01:12Z

tests/models/test_horovod.py

@@ -88,6 +88,7 @@ def test_horovod_cpu_implicit(tmpdir):
    _run_horovod(trainer_options)


+@pytest.mark.skipif(True, reason="fix hv")


just @pytest.mark.skip(reason="fix hv")

@Borda I removed it again in #2514 :)

mergify bot requested a review from a team July 5, 2020 11:53

awaelchli mentioned this pull request Jul 5, 2020

make loggers pickleable #2518

Merged

7 tasks

williamFalcon force-pushed the tpu_tests branch from a4a0a2f to 615a0c6 Compare July 6, 2020 11:09

williamFalcon changed the title ~~Tpu tests~~ Fix ddp tests + .test() Jul 6, 2020

jeremyjordan reviewed Jul 7, 2020

View reviewed changes

mergify bot requested a review from a team July 7, 2020 02:24

Borda self-requested a review July 7, 2020 05:43

Borda added bug Something isn't working priority: 0 High priority task labels Jul 7, 2020

Borda added this to the 0.8.x milestone Jul 7, 2020

Borda reviewed Jul 7, 2020

View reviewed changes

mergify bot requested a review from a team July 7, 2020 06:41

williamFalcon commented Jul 7, 2020

View reviewed changes

williamFalcon added 14 commits July 7, 2020 07:05

added base tests for tpu

5c2976c

added base tests for tpu

d11bb1a

added base tests for tpu

3f67989

added base tests for tpu

b75e5d4

added base tests for tpu

ccca15e

added base tests for tpu

22e756e

added base tests for tpu

e977380

added base tests for tpu

b9de0c4

added base tests for tpu

a9b67db

added base tests for tpu

e5585a5

added base tests for tpu

b7b378d

added base tests for tpu

fb9c139

added base tests for tpu

2cab5df

added base tests for tpu

4dbef0f

williamFalcon added 9 commits July 7, 2020 10:20

added base tests for tpu

890c846

added base tests for tpu

33939b6

added base tests for tpu

0f5c54e

added base tests for tpu

16b6eb0

added base tests for tpu

face96f

added base tests for tpu

7cc326f

added base tests for tpu

7abaae7

added base tests for tpu

3ba0f74

added base tests for tpu

e076453

Borda reviewed Jul 7, 2020

View reviewed changes

mergify bot requested a review from a team July 7, 2020 15:33

williamFalcon added 3 commits July 7, 2020 11:45

added base tests for tpu

786f893

added base tests for tpu

7bd0feb

added base tests for tpu

47d8f0d

williamFalcon merged commit 11069c8 into master Jul 7, 2020

Borda reviewed Jul 7, 2020

View reviewed changes

mergify bot requested a review from a team July 7, 2020 17:00

Borda reviewed Jul 7, 2020

View reviewed changes

mergify bot requested a review from a team July 7, 2020 17:01

This was referenced Jul 7, 2020

hparams are not logged in tensorboard #2371

Closed

Using the Trainer class more than once fails with "Address already in use" with the DDP backend #2537

Closed

ddp: trainer.test failure #2133

Closed

Borda deleted the tpu_tests branch July 7, 2020 23:11

mortonjt mentioned this pull request Jul 8, 2020

Multi-node GPU support is still outstanding flatironinstitute/deepblast#46

Closed

Borda mentioned this pull request Jul 9, 2020

run full TPU pytests #2560

Closed

awaelchli mentioned this pull request Jul 18, 2022

The TPU issues in Lightning #13720

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ddp tests + .test() #2512

Fix ddp tests + .test() #2512

williamFalcon commented Jul 5, 2020

pep8speaks commented Jul 5, 2020 •

edited

Loading

codecov bot commented Jul 5, 2020 •

edited

Loading

jeremyjordan Jul 7, 2020

williamFalcon Jul 7, 2020

jeremyjordan Jul 7, 2020

awaelchli Jul 7, 2020 •

edited

Loading

williamFalcon Jul 7, 2020

williamFalcon Jul 7, 2020

Borda Jul 7, 2020

williamFalcon Jul 7, 2020

williamFalcon Jul 7, 2020

mortonjt Jul 8, 2020

awaelchli Jul 10, 2020

mortonjt Jul 10, 2020

Borda Jul 7, 2020

williamFalcon commented Jul 7, 2020

tgaddair commented Jul 7, 2020

awaelchli commented Jul 7, 2020 •

edited

Loading

williamFalcon commented Jul 7, 2020

Borda Jul 7, 2020

awaelchli Jul 7, 2020

Borda Jul 7, 2020

awaelchli Jul 7, 2020

		@@ -88,6 +88,7 @@ def test_horovod_cpu_implicit(tmpdir):
		_run_horovod(trainer_options)


		@pytest.mark.skipif(True, reason="fix hv")

Fix ddp tests + .test() #2512

Fix ddp tests + .test() #2512

Conversation

williamFalcon commented Jul 5, 2020

pep8speaks commented Jul 5, 2020 • edited Loading

Comment last updated at 2020-07-07 16:09:33 UTC

codecov bot commented Jul 5, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awaelchli Jul 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

williamFalcon commented Jul 7, 2020

tgaddair commented Jul 7, 2020

awaelchli commented Jul 7, 2020 • edited Loading

williamFalcon commented Jul 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Jul 5, 2020 •

edited

Loading

codecov bot commented Jul 5, 2020 •

edited

Loading

awaelchli Jul 7, 2020 •

edited

Loading

awaelchli commented Jul 7, 2020 •

edited

Loading