Ensure the consistency of central_value_net with the same initial parameters before training starts in a multi-GPU setting. #297

annan-tang · 2024-07-09T11:58:28Z

Solution for Potential Issues with Multi-GPU/Node Training with Central Network Weights Initialization #296

…lti-GPU/node training.

ViktorM · 2024-07-12T08:41:52Z

Thank you for PR, I'll take a look tomorrow. Could you please update it to the latest master?

annan-tang · 2024-07-12T08:46:46Z

Hi @annan-tang,

Thank you for PR, I'll take a look tomorrow. Could you please update it to the latest master?

Thank you very much, I will update it later. And I'm doing experiments to show the effect. I will report more results later(within several days)

annan-tang · 2024-07-19T07:36:55Z

Hi,

I conducted a comparison with and without the central value network initial parameters alignment code on a 2-GPU setting. I used the default Trifinger example in IsaacGymEnvs with the following command:

torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py multi_gpu=True task=Trifinger headless=True seed={xxx}

For each situation, I tested five groups of random seeds ({xxx}) and found that there is not much difference with and without the initial parameters alignment. The reward curves are illustrated below:

Based on these results, it appears that the initial parameters alignment has little effect on the 2-GPU setting. However, I'm not sure if this would change when scaling up to dozens of GPUs.

Denys88 · 2024-09-11T19:17:19Z

merging it.

Add a broadcast for the initial parameters of central_value_net in mu…

42681aa

…lti-GPU/node training.

annan-tang changed the title ~~make sure the consistence of central_value_net with same initial params before training in multi-gpu setting~~ make sure the consistence of central_value_net with same initial params before training start in the context of multi-gpu setting Jul 9, 2024

annan-tang changed the title ~~make sure the consistence of central_value_net with same initial params before training start in the context of multi-gpu setting~~ Ensure the consistency of central_value_net with the same initial parameters before training starts in a multi-GPU setting. Jul 9, 2024

Denys88 merged commit 59d4c40 into Denys88:master Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure the consistency of central_value_net with the same initial parameters before training starts in a multi-GPU setting. #297

Ensure the consistency of central_value_net with the same initial parameters before training starts in a multi-GPU setting. #297

annan-tang commented Jul 9, 2024

ViktorM commented Jul 12, 2024

annan-tang commented Jul 12, 2024 •

edited

Loading

annan-tang commented Jul 19, 2024

Denys88 commented Sep 11, 2024

Ensure the consistency of central_value_net with the same initial parameters before training starts in a multi-GPU setting. #297

Ensure the consistency of central_value_net with the same initial parameters before training starts in a multi-GPU setting. #297

Conversation

annan-tang commented Jul 9, 2024

ViktorM commented Jul 12, 2024

annan-tang commented Jul 12, 2024 • edited Loading

annan-tang commented Jul 19, 2024

Denys88 commented Sep 11, 2024

annan-tang commented Jul 12, 2024 •

edited

Loading