Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory errors #102

Closed
schwab opened this issue Jan 1, 2021 · 3 comments
Closed

Out of memory errors #102

schwab opened this issue Jan 1, 2021 · 3 comments

Comments

@schwab
Copy link

schwab commented Jan 1, 2021

How much memory do we need to train muzero games? So far I've gotten out of memory with atari and breakout on a system with 64GB RAM. Most of the memory is used by the ReplayBuffer, so perhaps there are ways to limit that? Also, it doesn't seem to be using swap effectively and always dies when it hits 95% of system memory. (This is a linux ubuntu 20.10 system).

File "/home/mcstar_dev/.local/lib/python3.8/site-packages/ray/worker.py", line 1379, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayOutOfMemoryError): ray::SharedStorage.get_info() (pid=17985, ip=192.168.1.175)
File "python/ray/_raylet.pyx", line 423, in ray._raylet.execute_task
File "/home/mcstar_dev/.local/lib/python3.8/site-packages/ray/memory_monitor.py", line 130, in raise_if_low_memory
raise RayOutOfMemoryError(
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node tr001 is used (59.62 / 62.75 GB). The top 10 memory consumers are:

PID MEM COMMAND
17950 50.6GiB ray::ReplayBuffer
17944 1.48GiB ray::Trainer.continuous_update_weights()
17978 0.8GiB ray::SelfPlay.continuous_self_play()
9431 0.29GiB /usr/bin/gnome-shell
981691 0.24GiB /usr/bin/python3 /home/mcstar_dev/.local/bin/tensorboard --logdir ./results
17980 0.23GiB ray::SelfPlay.continuous_self_play()
110347 0.19GiB /opt/google/chrome/chrome --type=gpu-process --field-trial-handle=6870262073328718638,16916755515565
17599 0.17GiB python3 muzero.py
777299 0.16GiB /snap/code/52/usr/share/code/code --type=renderer --disable-color-correct-rendering --no-sandbox --f
777313 0.16GiB /snap/code/52/usr/share/code/code --type=renderer --disable-color-correct-rendering --no-sandbox --f

In addition, up to 1.93 GiB of shared memory is currently being used by the Ray object store.

@werner-duvaud
Copy link
Owner

werner-duvaud commented Jan 1, 2021

Hi,
On most simple games it doesn't need a lot of ram. For the atari version with the same network as in the MuZero paper, it can require more ram. I haven't tested atari so I have no idea how much ram is needed. On the other hand indeed the Replay buffer stores the games in ram, to train atari for a long time it would be necessary to implement a system to write on disk and load in ram the games with the greatest probabilities of being used. I will try to add that when I have time a PR is welcome otherwise.

@ahainaut
Copy link
Collaborator

ahainaut commented Jan 7, 2021

Hi @schwab ,
In addition to Werner's answer, you can lower the replay_buffer size in order to save some space in the ram.

@schwab schwab closed this as completed Jan 11, 2021
@schwab
Copy link
Author

schwab commented Jan 11, 2021

OK, will use smaller buffer for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants