Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Why so difficult to learn 0? #704

Closed
rusu24edward opened this issue Feb 24, 2020 · 7 comments
Closed

[Question] Why so difficult to learn 0? #704

rusu24edward opened this issue Feb 24, 2020 · 7 comments
Labels
question Further information is requested

Comments

@rusu24edward
Copy link

rusu24edward commented Feb 24, 2020

I've been working a lot with environments that have continuous actions spaces, and I've noticed some strange behavior that the agents seem to have a very hard time learning the optimal action to choose when the optimal action is zero. To test this, I've created a very simple environment where the agent simply chooses continuous values. The reward is shaped such that the agent is encouraged to choose 1 for the first N values and 0 for the next N value, like so:

def step(self, action):
    first_N = action[:N]
    second_N = action[N:]

    first_N_should_be = np.ones(self.N)
    second_N_should_be = np.zeros(self.N)

    reward = np.linalg.norm(first_N_should_be - first_N) + np.linalg.norm(second_N_should_be - second_N)
    return obs, -reward, done, info

I ran this with PPO2, A2C, and ACKTR for 3 million steps. Each time, the agent is able to learn 1 for the first N values very quickly, but it seems to have a very hard time learning 0 for the second_N values. Here is a graph demonstrating the average action taken over 200 steps for policies trained for 1mil, 2mil, and 3mil steps with PPO2. The black dots are the average and the flat lines are 1 standard deviation away.

Screen Shot 2020-02-24 at 12 02 51 PM

The agent does seem to be learning to choose 0 better over time because the standard deviation shrinks for longer-trained policies, but it is take MUCH longer than learning 1. I find this to be very strange. Why is it so difficult for the agent to explore 0?

This is similar to #473, but the answers there don't address my question. For the record, I am using normalized action space from [-1, 1].

System Info
Describe the characteristic of your environment:

  • Stable baselines 2.8.0 installed with pip
  • Python version: 3.7.4
  • Tensorflow version: 1.14
  • Numpy version: 1.17
@araffin araffin added the question Further information is requested label Feb 24, 2020
@araffin
Copy link
Collaborator

araffin commented Feb 24, 2020

Hello,
Did you try with SAC/TD3 too? (they are much more suited for continuous env)
did you also set the entropy coeff to zero? did you try increasing the learning rate?
why don't you use the deterministic policy for evaluation?

@rusu24edward
Copy link
Author

Did you try with SAC/TD3 too? (they are much more suited for continuous env)

Unfortunately, SAC and TD3 do not support multiprocessing, which is a make-or-break for me because I am running on a computer with 72 cores.

did you also set the entropy coeff to zero? did you try increasing the learning rate?

Thanks for the pointers here. Setting the entropy coeff to zero did not make any difference in PPO2, A2C, or ACKTR. Increasing the learning rate did help the performance for all 3. However, learning 1 still happens much, much faster. Any idea as to why this might be happening?

why don't you use the deterministic policy for evaluation?

I don't know what this means. Can you explain to me?

@smorad
Copy link

smorad commented Mar 2, 2020

why don't you use the deterministic policy for evaluation?

PPO samples from a distribution to select the next action. This means that even if 0 is the best action as dictated by the policy, there is still a chance of selecting other values. Maybe half the Gaussian is truncated at the bounds [-1, 1]? If half the probability mass was >1 but all samples were rounded to 1, you may see it 'converge' 2x as fast to 1. Just a thought.

@rusu24edward
Copy link
Author

PPO samples from a distribution to select the next action. This means that even if 0 is the best action as dictated by the policy, there is still a chance of selecting other values. Maybe half the Gaussian is truncated at the bounds [-1, 1]? If half the probability mass was >1 but all samples were rounded to 1, you may see it 'converge' 2x as fast to 1. Just a thought.

Thank you for clarifying this. It seems that the right way to think about this is not that it is having a hard time "learning 0" but that it is really good at "learning 1" because of clipping the Gaussian distribution.

I'm curious why the documentation says that we should normalize the continuous action space to [-1, 1]. If PPO2 is really starting with a Gaussian distribution with mu 0 and std 1, then this action space normalization will will clip 32% of the actions. It seems that it would be better to clip [-2, 2] for 95% or even [-3,3] for 99%. Furthermore, why use a Gaussian distribution instead of a Uniform distribution across the entire action space? Uniform will cover the space better....

@araffin
Copy link
Collaborator

araffin commented Mar 7, 2020

we should normalize the continuous action space to [-1, 1]
If PPO2 is really starting with a Gaussian distribution with mu 0 and std 1, then this action space normalization will will clip 32% of the actions.

Good point. First, this is a general advice, for instance DDPG and TD3 do not rely on a Gaussian distribution and directly output values in [-1, 1] (because of a tanh).
Then, it is true that many values may be clipped at the beginning of training but the standard deviation usually quickly decrease. As another solution, you can initialize the standard deviation with a different value (as it is done in spinning up). In Stable-Baselines V3, it will be easier to change.

Furthermore, why use a Gaussian distribution instead of a Uniform distribution across the entire action space? Uniform will cover the space better....

Uniform may be good for initial exploration (in fact, that is what is used for SAC/TD3 during the warmup phase). However, during the later phase of training, you want to do actions around the mean (deterministic action) while still exploring. That's why Gaussian distribution is used here.
A even better solution, that removes clipping, is to use a squashed Gaussian, that's what SAC is using.

@rusu24edward
Copy link
Author

Thank you for the great response @araffin. I appreciate the insight you bring to using this library and to RL in general. I will test out some of these modifications. I would really like to use SAC since it sounds like a better fit for me, but I can't sacrifice the multiprocessing. Any plans to support this for SAC?

@araffin
Copy link
Collaborator

araffin commented Mar 7, 2020

Any plans to support this for SAC?

Not really (see #324 ) because it is meant for robotics and would add too much complexity. However, you can easily create a custom version that supports it (using a for loop to fill the buffer).
In v3, it should be easier to create such version though...

Closing this issue then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants