Bug in Reward function #72

hepengli · 2021-05-31T02:40:50Z

I found a bug in the reward function in the file "./env/starcraft2/starcraft2.py", line 729. The bug is that when the enemies heal or regenerate shield, the allies will receive rewards. The location of the bug is in the function "reward_battle(self)", Line 729:

In Line 729, the argument "delta_enemy" is negative when the enemy heals or regenerates shield. However, Line 729 uses abs() to convert "delta_enemy + delta_deaths" to a positive reward. This means the allies are rewarded when the enemy heals.

One consequence of this problem is that the allies may only learn to hurt the enemies but never kill them so that they can receive rewards when the enemies heal afterwards. In that case, a policy can learn to increase rewards but never win the game.

douglasrizzo · 2021-07-09T15:31:49Z

Just to be clear, the problem you mention happens when delta_enemy is negative with a larger magnitude than delta_deaths, which would mean that agents are rewarded by injuring enemies, while not killing them. In this scenario, delta_enemy + delta_deaths would be negative, but the absolute of that would be positive.

Since self.reward_only_positive means we want to ignore negative rewards, I believe the best solution is to take the maximum of 0 and delta_enemy + delta_deaths. It would behave more closely like the reward when self.reward_only_positive == False, but never being negative and ignoring ally damage and deaths.

if self.reward_only_positive:
    reward = max(0, delta_enemy + delta_deaths)  # shield regeneration
else:
    reward = delta_enemy + delta_deaths - delta_ally

hepengli · 2021-07-09T15:51:08Z

Just to be clear, the problem you mention happens when delta_enemy is negative with a larger magnitude than delta_deaths, which would mean that agents are rewarded by injuring enemies, while not killing them. In this scenario, delta_enemy + delta_deaths would be negative, but the absolute of that would be positive.

Since self.reward_only_positive means we want to ignore negative rewards, I believe the best solution is to take the maximum of 0 and delta_enemy + delta_deaths. It would behave more closely like the reward when self.reward_only_positive == False, but never being negative and ignoring ally damage and deaths.
if self.reward_only_positive:
    reward = max(0, delta_enemy + delta_deaths)  # shield regeneration
else:
    reward = delta_enemy + delta_deaths - delta_ally

Yes! And this situation happens especially in Maps with Protoss, which can regenerate shields. And I found in the map "3s5z_vs_3s6z" that the allies can learn a strategy to increase the reward without winning the game. Specifically, the allies can learn a pattern to injure the enemies a little and immediately run away from them by hiding in a corner and waiting the enemies to recover, and then repeat. However, the problem can be solved when modifying the reward function to this:

if self.reward_only_positive:
reward = max(0, delta_enemy + delta_deaths) # shield regeneration
else:
reward = delta_enemy + delta_deaths - delta_ally

douglasrizzo · 2021-07-09T16:05:14Z

I sent a quick PR with this tiny change. Check if it solves the issues you have in your experiments.

hepengli · 2021-07-09T16:21:58Z

Thanks! I think this will do. I will check this again and get back to you soon.

samvelyan · 2021-07-19T15:16:37Z

Thanks both for pointing this out and for sending the PR #76. We're going to see how to best integrate this fix in the upcoming SMAC versions to avoid confusion. One issue we've noticed is that some people compare results using different SMAC/StarCraftII versions and report unfair comparisons between methods in their work.

douglasrizzo · 2021-07-28T03:40:41Z

I have just watched a similar behavior happen in maps which have Medivacs (MMM and MMM2). They have a healing power and my agents have learned to wait for the Medivac to heal the last unit before killing it, sometimes at the expense of the match.

hepengli · 2021-07-28T03:53:20Z

Yes! I've also noticed this situation happens in MMM and MMM2. But your solution by changing the only positive reward to "reward = max(0, delta_enemy + delta_deaths)" is able to fix this issue.

See oxwhirl#72 for more details.

xihuai18 · 2024-07-08T07:23:22Z

Why not merge #76 ?

samvelyan · 2024-07-08T09:05:53Z

Given how much the benchmark has been used by the community, fixing this issue now will result in unfair comparisons with existing work. Therefore, we will not merge it with the main branch in this repo.

If you really want to use that particular version of SMAC, you are welcome to use the branch of #76: https://github.com/douglasrizzo/smac/tree/patch-1. But you must make it clear that this is not the standard version of the benchmark when presenting those results.

Lastly, this issue is resolved in SMACv2, the second version of the benchmark which I encourage you to use instead: https://github.com/oxwhirl/smacv2.

hepengli changed the title ~~Reward for Protos has some problem~~ Bug in Reward function Jun 28, 2021

douglasrizzo added a commit to douglasrizzo/smac that referenced this issue Jul 9, 2021

Addresses oxwhirl#72

8042fdc

douglasrizzo mentioned this issue Jul 9, 2021

Prevent positive reward for allowing enemies to heal #76

Open

samvelyan added a commit to benellis3/smac that referenced this issue Apr 19, 2022

Fixed the reward hacking bug

8ab0b54

See oxwhirl#72 for more details.

samvelyan mentioned this issue Apr 19, 2022

Fixed the reward hacking bug benellis3/smac#7

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in Reward function #72

Bug in Reward function #72

hepengli commented May 31, 2021 •

edited

Loading

douglasrizzo commented Jul 9, 2021

hepengli commented Jul 9, 2021

douglasrizzo commented Jul 9, 2021

hepengli commented Jul 9, 2021

samvelyan commented Jul 19, 2021 •

edited

Loading

douglasrizzo commented Jul 28, 2021

hepengli commented Jul 28, 2021

xihuai18 commented Jul 8, 2024

samvelyan commented Jul 8, 2024

Bug in Reward function #72

Bug in Reward function #72

Comments

hepengli commented May 31, 2021 • edited Loading

douglasrizzo commented Jul 9, 2021

hepengli commented Jul 9, 2021

douglasrizzo commented Jul 9, 2021

hepengli commented Jul 9, 2021

samvelyan commented Jul 19, 2021 • edited Loading

douglasrizzo commented Jul 28, 2021

hepengli commented Jul 28, 2021

xihuai18 commented Jul 8, 2024

samvelyan commented Jul 8, 2024

hepengli commented May 31, 2021 •

edited

Loading

samvelyan commented Jul 19, 2021 •

edited

Loading