Multi-GPU usage #95

fil-mp · 2021-10-21T14:08:51Z

How can I use multiple GPUs for simulation and training? I am enabling horovod but it seems that it can only use one device.

ViktorM · 2021-10-22T23:15:49Z

How did you try to run training on multiple GPUs? Can you share the script you ran?

For example with Isaac Gym you can run it as: ```
horovodrun -np 8 python rlg_train.py --task <task_name> --horovod --headless
-np 8

fil-mp · 2021-10-25T14:34:13Z

Thanks for the response.

I have been using the same command but I get this error:

fil-mp · 2021-10-26T15:43:12Z

OK, I think that's the expected behavior, when MAX_EPOCHS have been reached and the root terminates.

Denys88 · 2021-10-26T15:53:17Z

Thanks for the update.We will take a look how remove this message on terminate.

mohamedhassanmus · 2021-10-27T10:04:54Z

I am getting the same error. How can I bypass this?
Thanks!

fil-mp · 2021-10-27T10:18:23Z

You can ignore this for now, since it doesn't affect the training. It is just a message on terminate, which they will probably remove.

mohamedhassanmus · 2021-10-27T10:21:12Z

In my case, the code stops after this message and the training doesn't continue.

Denys88 · 2021-10-30T18:45:32Z

Could you show a whole error callstack?

Denys88 · 2021-11-02T02:33:39Z

I've found what causes this issue: only rank 0 process checked number of epochs. Ill make a fix in a few days.

Denys88 · 2021-11-03T04:11:49Z

I've found 3 small issues:

exit on max epochs
exit on max rewards
uninitialized variable ( at least in discrete envs)
I don't have access to the multigpu machine right now and simulated it with multi cpu. But it should work :)
https://pypi.org/project/rl-games/1.1.4/

1tac11 · 2023-04-18T19:26:25Z

Is this tested for multi instances or only multi GPUs on one instance? Since I had issues for multi instances.

ViktorM · 2023-04-18T20:16:39Z

It was tested on a single node, up to 8 GPUs with Isaac Gym. Each instance of Isaac Gym was running on its own GPU.

Denys88 mentioned this issue Nov 3, 2021

fixed horovod doesnt exit properly #97

Merged

Denys88 closed this as completed Nov 3, 2021

Denys88 mentioned this issue Jun 3, 2022

Deprecate horovod in favor of torch.distributed #171

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU usage #95

Multi-GPU usage #95

fil-mp commented Oct 21, 2021

ViktorM commented Oct 22, 2021

fil-mp commented Oct 25, 2021

fil-mp commented Oct 26, 2021

Denys88 commented Oct 26, 2021

mohamedhassanmus commented Oct 27, 2021

fil-mp commented Oct 27, 2021

mohamedhassanmus commented Oct 27, 2021

Denys88 commented Oct 30, 2021

Denys88 commented Nov 2, 2021

Denys88 commented Nov 3, 2021

1tac11 commented Apr 18, 2023

ViktorM commented Apr 18, 2023

Multi-GPU usage #95

Multi-GPU usage #95

Comments

fil-mp commented Oct 21, 2021

ViktorM commented Oct 22, 2021

fil-mp commented Oct 25, 2021

fil-mp commented Oct 26, 2021

Denys88 commented Oct 26, 2021

mohamedhassanmus commented Oct 27, 2021

fil-mp commented Oct 27, 2021

mohamedhassanmus commented Oct 27, 2021

Denys88 commented Oct 30, 2021

Denys88 commented Nov 2, 2021

Denys88 commented Nov 3, 2021

1tac11 commented Apr 18, 2023

ViktorM commented Apr 18, 2023