Prototype torch.distributed integration #158

vwxyzjn · 2022-05-20T20:29:09Z

This PR Prototype torch.distributed integration for multi GPU

Denys88 · 2022-05-21T15:49:54Z

rl_games/common/a2c_common.py

@@ -966,7 +973,7 @@ def train(self):
                update_time = 0
            if self.multi_gpu:
                    should_exit_t = torch.tensor(should_exit).float()
-                    self.hvd.broadcast_value(should_exit_t, 'should_exit')
+                    # self.hvd.broadcast_value(should_exit_t, 'should_exit') # what is the purpose of this?


here is a chance that one job will exit a little bit earlier, as result other jobs will crash.

This method seems to be deprecated by horovod: the closest method I can find is https://horovod.readthedocs.io/en/stable/api.html#horovod.torch.broadcast_.

What exactly is this broadcast_value doing? If rank 0's should_exit_t=False and rank 1's should_exit_t=True, would broadcast_value overwrite rank 1's should_exit_t?

Denys88 · 2022-05-21T15:54:45Z

rl_games/configs/atari/ppo_breakout_torch_multigpu.yaml

@@ -0,0 +1,84 @@
+params:  


btw could you use breakout from envpool?

Envpool is actually more tricky. When prototyping with envpool it's actually slower with multi-GPU, at least out of the box. This is because envpool uses different threads and have complex interactions with these threads that are a bit difficult to control. For this reason, I have chosen the regular gym API for controlled performance.

vwxyzjn · 2022-05-24T16:58:43Z

cc @markelsanz14, I prototyped the torch.distributed integration but it's only 6% faster. I still feel I am missing the bottleneck somewhere because the prototype with CleanRL was like 25% faster

Denys88 reviewed May 21, 2022

View reviewed changes

vwxyzjn closed this May 25, 2022

vwxyzjn force-pushed the torch.distributed branch from 7a49df2 to 86f5e82 Compare May 25, 2022 16:04

vwxyzjn deleted the torch.distributed branch May 25, 2022 16:06

vwxyzjn mentioned this pull request May 25, 2022

Prototype torch.distributed #165

Closed

vwxyzjn mentioned this pull request Jun 3, 2022

Deprecate horovod in favor of torch.distributed #171

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prototype torch.distributed integration #158

Prototype torch.distributed integration #158

vwxyzjn commented May 20, 2022

Denys88 May 21, 2022

vwxyzjn May 23, 2022

Denys88 May 24, 2022

Denys88 May 21, 2022

vwxyzjn May 23, 2022

vwxyzjn commented May 24, 2022

Prototype torch.distributed integration #158

Prototype torch.distributed integration #158

Conversation

vwxyzjn commented May 20, 2022

Denys88 May 21, 2022

Choose a reason for hiding this comment

vwxyzjn May 23, 2022

Choose a reason for hiding this comment

Denys88 May 24, 2022

Choose a reason for hiding this comment

Denys88 May 21, 2022

Choose a reason for hiding this comment

vwxyzjn May 23, 2022

Choose a reason for hiding this comment

vwxyzjn commented May 24, 2022