-
Notifications
You must be signed in to change notification settings - Fork 725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Why so difficult to learn 0? #704
Comments
Hello, |
Unfortunately, SAC and TD3 do not support multiprocessing, which is a make-or-break for me because I am running on a computer with 72 cores.
Thanks for the pointers here. Setting the entropy coeff to zero did not make any difference in PPO2, A2C, or ACKTR. Increasing the learning rate did help the performance for all 3. However, learning 1 still happens much, much faster. Any idea as to why this might be happening?
I don't know what this means. Can you explain to me? |
PPO samples from a distribution to select the next action. This means that even if 0 is the best action as dictated by the policy, there is still a chance of selecting other values. Maybe half the Gaussian is truncated at the bounds |
Thank you for clarifying this. It seems that the right way to think about this is not that it is having a hard time "learning 0" but that it is really good at "learning 1" because of clipping the Gaussian distribution. I'm curious why the documentation says that we should normalize the continuous action space to [-1, 1]. If PPO2 is really starting with a Gaussian distribution with mu 0 and std 1, then this action space normalization will will clip 32% of the actions. It seems that it would be better to clip [-2, 2] for 95% or even [-3,3] for 99%. Furthermore, why use a Gaussian distribution instead of a Uniform distribution across the entire action space? Uniform will cover the space better.... |
Good point. First, this is a general advice, for instance DDPG and TD3 do not rely on a Gaussian distribution and directly output values in [-1, 1] (because of a tanh).
Uniform may be good for initial exploration (in fact, that is what is used for SAC/TD3 during the warmup phase). However, during the later phase of training, you want to do actions around the mean (deterministic action) while still exploring. That's why Gaussian distribution is used here. |
Thank you for the great response @araffin. I appreciate the insight you bring to using this library and to RL in general. I will test out some of these modifications. I would really like to use SAC since it sounds like a better fit for me, but I can't sacrifice the multiprocessing. Any plans to support this for SAC? |
Not really (see #324 ) because it is meant for robotics and would add too much complexity. However, you can easily create a custom version that supports it (using a for loop to fill the buffer). Closing this issue then. |
I've been working a lot with environments that have continuous actions spaces, and I've noticed some strange behavior that the agents seem to have a very hard time learning the optimal action to choose when the optimal action is zero. To test this, I've created a very simple environment where the agent simply chooses continuous values. The reward is shaped such that the agent is encouraged to choose 1 for the first N values and 0 for the next N value, like so:
I ran this with PPO2, A2C, and ACKTR for 3 million steps. Each time, the agent is able to learn 1 for the first N values very quickly, but it seems to have a very hard time learning 0 for the second_N values. Here is a graph demonstrating the average action taken over 200 steps for policies trained for 1mil, 2mil, and 3mil steps with PPO2. The black dots are the average and the flat lines are 1 standard deviation away.
The agent does seem to be learning to choose 0 better over time because the standard deviation shrinks for longer-trained policies, but it is take MUCH longer than learning 1. I find this to be very strange. Why is it so difficult for the agent to explore 0?
This is similar to #473, but the answers there don't address my question. For the record, I am using normalized action space from [-1, 1].
System Info
Describe the characteristic of your environment:
The text was updated successfully, but these errors were encountered: