Debugging multi-GPU issue #161

vwxyzjn · 2022-05-23T14:49:57Z

In IsaacGymEnvs, rl-games + multiGPU seems to have some issues. As shown in the screenshot, rl-games + multiGPU performs uses twice amount of data and performs worse than the single GPU setting in Ant

This issue tracks the investigation of this issue.

Proposed debugging route

I suggest making sure we make sure there is no loss in sample efficiency first before scaling to more envs by matching implementation details in our prototype in CleanRL: https://cleanrl-git-new-multi-gpu-vwxyzjn.vercel.app/rl-algorithms/ppo/#implementation-details_6.

Identified issues:

1. Seeding logic and configuration issue

Fix multi-gpu seeds #162

We need to seed multiGPU processes with different seeds to decorrelate experience, otherwise the multiGPU processes will produce the exact observations.

Configuration-wise we can set the overall seed with params.seed and env seed with params.config.env_config.seed, so if params.config.env_config.seed is set but params.seed is not set, we get identical observations from the environments as shown below:

This is probably ok since the agent still samples different actions, but it's nonetheless a problem. The correct implementation is to use seed = seed + local_rank.

2. stepping logic issue

Refacotor optimizer step logic #163

After fixing #163, I was able to match the sample efficiency in the single GPU setting:

However, the wall time is slower than I had expected. On a separate benchmark I made with CleanRL, the experiments show horovod should make Ant step 20% faster.

Maybe it's the averaging stats overhead? In the CleanRL benchmark experiments I did not mess with stats at all.

The text was updated successfully, but these errors were encountered:

Denys88 · 2022-06-19T05:22:22Z

@vwxyzjn can we close it?

vwxyzjn · 2022-06-19T12:22:23Z

Closed by #171

1tac11 · 2023-04-18T19:25:38Z

Hi there
Is multi instance multi flu working?

vwxyzjn closed this as completed Jun 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debugging multi-GPU issue #161

Debugging multi-GPU issue #161

vwxyzjn commented May 23, 2022 •

edited

Loading

Denys88 commented Jun 19, 2022

vwxyzjn commented Jun 19, 2022

1tac11 commented Apr 18, 2023

Debugging multi-GPU issue #161

Debugging multi-GPU issue #161

Comments

vwxyzjn commented May 23, 2022 • edited Loading

Proposed debugging route

Identified issues:

1. Seeding logic and configuration issue

2. stepping logic issue

Denys88 commented Jun 19, 2022

vwxyzjn commented Jun 19, 2022

1tac11 commented Apr 18, 2023

vwxyzjn commented May 23, 2022 •

edited

Loading