Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak #73

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Fix memory leak #73

wants to merge 1 commit into from

Conversation

j3soon
Copy link

@j3soon j3soon commented Jul 3, 2020

  • In EpisodeRunner, the actions should either be detached from the computation graph or be converted into a numpy array before being stored into the replay buffer. In the original code, the entire computation graph for generating the action isn't released and consume unnecessary amount of memory. And may cause OOM if the program has been ran for a long time.
  • In Logger, the stats should be cleared periodically to avoid accumulating unnecessary logs in the memory.

@GoingMyWay
Copy link

@j3soon Hi, this is an interesting issue. How many steps should be run to reproduce the OOM issue? Does it exist?

@j3soon
Copy link
Author

j3soon commented Jul 9, 2020

  • For the memory leak in EpisodeRunner:

    Running EpisodicRunner using QMIX with 8M steps using hidden layer with 512 neurons should reproduce the issue.

    python3 src/main.py --config=qmix --env-config=sc2 with env_args.map_name=27m_vs_30m runner=episode rnn_hidden_dim=512 test_interval=40000 log_interval=40000 runner_log_interval=40000 learner_log_interval=40000 t_max=8200000 save_model_interval=400000

    The CPU memory will grow linearly through time.

  • For the memory in Logger

    python3 src/main.py --config=iql --env-config=sc2 with env_args.map_name=3m runner=episode test_interval=40000 log_interval=1 runner_log_interval=1 learner_log_interval=1 t_max=8200000 save_model_interval=400000

    Try logging more data (Maybe 100KB-1MB data per 1 timestep), the CPU memory will also grow linearly through time.

    This issue is not as severe as EpisodicRunner, since the size of the log is small in default. The issue only arises when we modify the Logger to log more data that consumes more memory.

@GoingMyWay
Copy link

  • For the memory leak in EpisodeRunner:
    Running EpisodicRunner using QMIX with 8M steps using hidden layer with 512 neurons should reproduce the issue.

    python3 src/main.py --config=qmix --env-config=sc2 with env_args.map_name=27m_vs_30m runner=episode rnn_hidden_dim=512 test_interval=40000 log_interval=40000 runner_log_interval=40000 learner_log_interval=40000 t_max=8200000 save_model_interval=400000

    The CPU memory will grow linearly through time.

  • For the memory in Logger

    python3 src/main.py --config=iql --env-config=sc2 with env_args.map_name=3m runner=episode test_interval=40000 log_interval=1 runner_log_interval=1 learner_log_interval=1 t_max=8200000 save_model_interval=400000

    Try logging more data (Maybe 100KB-1MB data per 1 timestep), the CPU memory will also grow linearly through time.
    This issue is not as severe as EpisodicRunner, since the size of the log is small in default. The issue only arises when we modify the Logger to log more data that consumes more memory.

Great finding. Cool, do you think the issue is from the actions that are not detached from the graph?

@j3soon
Copy link
Author

j3soon commented Jul 9, 2020

Yes. PyTorch maintains a computation graph during the forward pass to record the tensor operations. When the loss is defined and we perform tensor.backward, the computation graph is back-traced for backpropagation and released along the way.

During training, the computation graph is released since we do perform tensor.backward. While during collecting episode experiences (interacting with the environment), the action is calculated through forward passes and the action tensor is stored directly into the replay buffer. Since tensor.backward isn't called anywhere, the computation graph isn't released, making the memory consumption of the action tensor unreasonably large. Thus we should call either action = action.detach() or action = action.cpu().numpy(). Both of them releases the computation graph.

@GoingMyWay
Copy link

@j3soon Fantastic finding. I will try it. Very good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants