Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU usage #95

Closed
fil-mp opened this issue Oct 21, 2021 · 12 comments
Closed

Multi-GPU usage #95

fil-mp opened this issue Oct 21, 2021 · 12 comments

Comments

@fil-mp
Copy link

fil-mp commented Oct 21, 2021

How can I use multiple GPUs for simulation and training? I am enabling horovod but it seems that it can only use one device.

@ViktorM
Copy link
Collaborator

ViktorM commented Oct 22, 2021

How did you try to run training on multiple GPUs? Can you share the script you ran?

For example with Isaac Gym you can run it as: ```
horovodrun -np 8 python rlg_train.py --task <task_name> --horovod --headless
-np 8

@fil-mp
Copy link
Author

fil-mp commented Oct 25, 2021

Thanks for the response.

I have been using the same command but I get this error:
Screenshot from 2021-10-25 16-16-23

@fil-mp
Copy link
Author

fil-mp commented Oct 26, 2021

OK, I think that's the expected behavior, when MAX_EPOCHS have been reached and the root terminates.

@Denys88
Copy link
Owner

Denys88 commented Oct 26, 2021

Thanks for the update.We will take a look how remove this message on terminate.

@mohamedhassanmus
Copy link

I am getting the same error. How can I bypass this?
Thanks!

@fil-mp
Copy link
Author

fil-mp commented Oct 27, 2021

You can ignore this for now, since it doesn't affect the training. It is just a message on terminate, which they will probably remove.

@mohamedhassanmus
Copy link

In my case, the code stops after this message and the training doesn't continue.

@Denys88
Copy link
Owner

Denys88 commented Oct 30, 2021

Could you show a whole error callstack?

@Denys88
Copy link
Owner

Denys88 commented Nov 2, 2021

I've found what causes this issue: only rank 0 process checked number of epochs. Ill make a fix in a few days.

@Denys88
Copy link
Owner

Denys88 commented Nov 3, 2021

I've found 3 small issues:

  1. exit on max epochs
  2. exit on max rewards
  3. uninitialized variable ( at least in discrete envs)
    I don't have access to the multigpu machine right now and simulated it with multi cpu. But it should work :)
    https://pypi.org/project/rl-games/1.1.4/

@1tac11
Copy link

1tac11 commented Apr 18, 2023

Is this tested for multi instances or only multi GPUs on one instance? Since I had issues for multi instances.

@ViktorM
Copy link
Collaborator

ViktorM commented Apr 18, 2023

It was tested on a single node, up to 8 GPUs with Isaac Gym. Each instance of Isaac Gym was running on its own GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants