-
Notifications
You must be signed in to change notification settings - Fork 725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
True rewards remaining "zero" in the trajectories in stable baselines2 for custom environments #1167
Comments
Hey. Unfortunately we do not have time to offer custom tech support for custom environments. The library code is tested to function (mostly) correctly, so my knee-jerk reply is that something may be off in your environment. I would recommend two things:
|
I have checked my environment with By the way I forgot to show the tensorboard plot, which is shown in the following figure (with the straight horizontal line for episode reward plot). I think @Miffyli , you are right, I am starting considering to migrate to stable_baselines3 (at least my next research project will not be in stable-baselines2). But my code base is very long (spectral normalization, dense connection, custom "amsgrad" optimizer implementation and custom q-value network method for Soft Actor-Critic for the implementation of wolpertinger algorithm) which will be major cause my hesitation. |
Unfortunately I do not have other tips to give and no time to start digging through custom code to find errors :( . I know this is a very bad, maybe even rude-ish, answer which assumes it is an user error, but there are many parts where env implementation can go wrong and cause confusing stuff like this. If possible, I would recommend taking an environment where the rewards work as expected and start changing it towards your final env step-by-step. |
No problem, I am trying resolving it, I will report the reasons as soon as I find them out. |
I think in SB3 other things become a bottleneck before the eager mode of PyTorch is the slowing down factor: handling the data, computing returns, etc etc takes much more time than actually running in the network graph. I personally do not know of the performance beyond RL, but AFAIK it is not worth the effort to change to TF2 just to get bit of speed boost. |
I think, if the number of CPUs (for parallel rollouts) are much larger then the number of GPU SMs, then the data will always be available for training and GPUs will always be busy, and thus it may be the case that the eager mode will become the bottle neck (which I think the same that it may not be too severe). Thanks alot!! |
I am using reinforcement learning for mathematical optimization, using PPO2 agent in google colab.
In case of my custom environment, episode rewards are remaining zero when I saw the tensorboard. Also when I use print statement to print out the "true_reward" inside "ppo2.py" file (as shown in the figure), then I am getting nothing but zero vector.
Due to this, my agent is not learning correctly.
The following things are important to note here:
not being collected.
The text was updated successfully, but these errors were encountered: