Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reward shaping not removed in evaluation in CarRacing-From-Pixels-PPO #3

Closed
lerrytang opened this issue Mar 17, 2020 · 5 comments
Closed

Comments

@lerrytang
Copy link

Hi,

The figure and log in README shows scores >1000, which due to the CarRacing's design, is not quite possible.
It turns out that the reward shaping in Wrapper.step() is not removed in evaluation and that leads to incorrect results.
Commenting out relevant lines, I got an average score of 820 over 100 episodes.

@Rafael1s
Copy link
Owner

Thanks for your note,
Could you info that the relevant lines should be commented out or your version of Wrapper.step()

@lerrytang
Copy link
Author

Hi,
The following snippet is what I used for evaluation, let me know if this makes sense to you :)

    def step(self, action):
        total_reward = 0
        for i in range(action_repeat):
            img_rgb, reward, die, _ = env.step(action)
            # don't penalize "die state"
#            if die:
#                reward += 100
            # green penalty
#            if np.mean(img_rgb[:, :, 1]) > 185.0:
#                reward -= 0.05
            total_reward += reward
            # if no reward recently, end the episode
#            done = True if self.av_r(reward) <= -0.1 else False
            done = False
            if done or die:
                break
        img_gray = rgb2gray(img_rgb)
        self.stack.pop(0)
        self.stack.append(img_gray)
        assert len(self.stack) == img_stack
        return np.array(self.stack), total_reward, done, die

@Rafael1s
Copy link
Owner

Hi,
I cannot agree with your version. For example, where your "green penalty"? You need penalize the car driving to the green field. Possibly, the green threshold should be lower than 185, or reward should be more accurate than -0.05.

@lerrytang
Copy link
Author

Let's see if the followings can describe my points better.

  1. There are 2 versions of Wrapper.step(), 1 for training and 1 for evaluation.
  2. You can add whatever reward shaping in the training version. E.g., penalty for driving to grass.
  3. You should not add reward shaping in the evaluation version. E.g., CarRacing is considered solved when avg reward > 900, but it is not very fair if you add 100 upon die==True or end the episode earlier if you notice the car is not running well, right?

The code snippet I used was for evaluation.

@Rafael1s
Copy link
Owner

@lerrytang
Let us look at the OpenAI CarRacing environment code
https://github.com/openai/gym/blob/master/gym/envs/box2d/car_racing.py
lines 337-339

If the car went outward the field, the reward is penalized by -100.
However, if the track is over (absolutely successful case) the reward ALSO penalized by -100
Then, for fairness, we restore the reward by +100 in Wrapper.step()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants