[Question] Why so difficult to learn 0? #704

rusu24edward · 2020-02-24T20:35:13Z

I've been working a lot with environments that have continuous actions spaces, and I've noticed some strange behavior that the agents seem to have a very hard time learning the optimal action to choose when the optimal action is zero. To test this, I've created a very simple environment where the agent simply chooses continuous values. The reward is shaped such that the agent is encouraged to choose 1 for the first N values and 0 for the next N value, like so:

def step(self, action):
    first_N = action[:N]
    second_N = action[N:]

    first_N_should_be = np.ones(self.N)
    second_N_should_be = np.zeros(self.N)

    reward = np.linalg.norm(first_N_should_be - first_N) + np.linalg.norm(second_N_should_be - second_N)
    return obs, -reward, done, info

I ran this with PPO2, A2C, and ACKTR for 3 million steps. Each time, the agent is able to learn 1 for the first N values very quickly, but it seems to have a very hard time learning 0 for the second_N values. Here is a graph demonstrating the average action taken over 200 steps for policies trained for 1mil, 2mil, and 3mil steps with PPO2. The black dots are the average and the flat lines are 1 standard deviation away.

The agent does seem to be learning to choose 0 better over time because the standard deviation shrinks for longer-trained policies, but it is take MUCH longer than learning 1. I find this to be very strange. Why is it so difficult for the agent to explore 0?

This is similar to #473, but the answers there don't address my question. For the record, I am using normalized action space from [-1, 1].

System Info
Describe the characteristic of your environment:

Stable baselines 2.8.0 installed with pip
Python version: 3.7.4
Tensorflow version: 1.14
Numpy version: 1.17

araffin · 2020-02-24T22:06:26Z

Hello,
Did you try with SAC/TD3 too? (they are much more suited for continuous env)
did you also set the entropy coeff to zero? did you try increasing the learning rate?
why don't you use the deterministic policy for evaluation?

rusu24edward · 2020-02-24T23:43:05Z

Did you try with SAC/TD3 too? (they are much more suited for continuous env)

Unfortunately, SAC and TD3 do not support multiprocessing, which is a make-or-break for me because I am running on a computer with 72 cores.

did you also set the entropy coeff to zero? did you try increasing the learning rate?

Thanks for the pointers here. Setting the entropy coeff to zero did not make any difference in PPO2, A2C, or ACKTR. Increasing the learning rate did help the performance for all 3. However, learning 1 still happens much, much faster. Any idea as to why this might be happening?

why don't you use the deterministic policy for evaluation?

I don't know what this means. Can you explain to me?

smorad · 2020-03-02T22:07:58Z

why don't you use the deterministic policy for evaluation?

PPO samples from a distribution to select the next action. This means that even if 0 is the best action as dictated by the policy, there is still a chance of selecting other values. Maybe half the Gaussian is truncated at the bounds [-1, 1]? If half the probability mass was >1 but all samples were rounded to 1, you may see it 'converge' 2x as fast to 1. Just a thought.

rusu24edward · 2020-03-05T19:10:22Z

PPO samples from a distribution to select the next action. This means that even if 0 is the best action as dictated by the policy, there is still a chance of selecting other values. Maybe half the Gaussian is truncated at the bounds [-1, 1]? If half the probability mass was >1 but all samples were rounded to 1, you may see it 'converge' 2x as fast to 1. Just a thought.

Thank you for clarifying this. It seems that the right way to think about this is not that it is having a hard time "learning 0" but that it is really good at "learning 1" because of clipping the Gaussian distribution.

I'm curious why the documentation says that we should normalize the continuous action space to [-1, 1]. If PPO2 is really starting with a Gaussian distribution with mu 0 and std 1, then this action space normalization will will clip 32% of the actions. It seems that it would be better to clip [-2, 2] for 95% or even [-3,3] for 99%. Furthermore, why use a Gaussian distribution instead of a Uniform distribution across the entire action space? Uniform will cover the space better....

araffin · 2020-03-07T14:06:53Z

we should normalize the continuous action space to [-1, 1]
If PPO2 is really starting with a Gaussian distribution with mu 0 and std 1, then this action space normalization will will clip 32% of the actions.

Good point. First, this is a general advice, for instance DDPG and TD3 do not rely on a Gaussian distribution and directly output values in [-1, 1] (because of a tanh).
Then, it is true that many values may be clipped at the beginning of training but the standard deviation usually quickly decrease. As another solution, you can initialize the standard deviation with a different value (as it is done in spinning up). In Stable-Baselines V3, it will be easier to change.

Furthermore, why use a Gaussian distribution instead of a Uniform distribution across the entire action space? Uniform will cover the space better....

Uniform may be good for initial exploration (in fact, that is what is used for SAC/TD3 during the warmup phase). However, during the later phase of training, you want to do actions around the mean (deterministic action) while still exploring. That's why Gaussian distribution is used here.
A even better solution, that removes clipping, is to use a squashed Gaussian, that's what SAC is using.

rusu24edward · 2020-03-07T15:50:58Z

Thank you for the great response @araffin. I appreciate the insight you bring to using this library and to RL in general. I will test out some of these modifications. I would really like to use SAC since it sounds like a better fit for me, but I can't sacrifice the multiprocessing. Any plans to support this for SAC?

araffin · 2020-03-07T16:10:06Z

Any plans to support this for SAC?

Not really (see #324 ) because it is meant for robotics and would add too much complexity. However, you can easily create a custom version that supports it (using a for loop to fill the buffer).
In v3, it should be easier to create such version though...

Closing this issue then.

araffin added the question Further information is requested label Feb 24, 2020

araffin closed this as completed Mar 7, 2020

Miffyli mentioned this issue Oct 7, 2020

[question] unstable actions in PPO #1018

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Why so difficult to learn 0? #704

[Question] Why so difficult to learn 0? #704

rusu24edward commented Feb 24, 2020 •

edited

Loading

araffin commented Feb 24, 2020

rusu24edward commented Feb 24, 2020

smorad commented Mar 2, 2020 •

edited

Loading

rusu24edward commented Mar 5, 2020

araffin commented Mar 7, 2020

rusu24edward commented Mar 7, 2020

araffin commented Mar 7, 2020

[Question] Why so difficult to learn 0? #704

[Question] Why so difficult to learn 0? #704

Comments

rusu24edward commented Feb 24, 2020 • edited Loading

araffin commented Feb 24, 2020

rusu24edward commented Feb 24, 2020

smorad commented Mar 2, 2020 • edited Loading

rusu24edward commented Mar 5, 2020

araffin commented Mar 7, 2020

rusu24edward commented Mar 7, 2020

araffin commented Mar 7, 2020

rusu24edward commented Feb 24, 2020 •

edited

Loading

smorad commented Mar 2, 2020 •

edited

Loading