Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are you actually using the learned intrinsic reward for the agent? #9

Open
ferreirafabio opened this issue Feb 20, 2021 · 6 comments
Open

Comments

@ferreirafabio
Copy link

ferreirafabio commented Feb 20, 2021

Hi,

I can only see that you optimize the intrinsic loss in your code. Can you point me to the line where you add the intrinsic rewards to the actual environment/extrinsic rewards?

In some areas of your code I can see comments like
# total reward = int reward
which would, according to the original paper, be wrong, no?

Thank you.

@ruoshiliu
Copy link

Also new to the repo, but here the loss is composed of both intrinsic and extrinsic reward:

loss = (actor_loss + 0.5 * critic_loss - 0.001 * entropy) + forward_loss + inverse_loss

@ferreirafabio
Copy link
Author

ferreirafabio commented Mar 3, 2021

Thanks @ruoshiliu. Yes, I saw the loss. But in addition to optimizing the loss you also need to use the intrinsic rewards (which is the result from optimizing its loss) for the agent as stated in the paper. Only optimizing the loss is not equivalent to using the intrinsic reward as an outcome of optimizing its loss.

@ferreirafabio ferreirafabio changed the title Are you optimizing over the sum of extrinsic and intrinsic rewards? Are you actually using the learned intrinsic reward for the agent? Mar 3, 2021
@ruoshiliu
Copy link

ruoshiliu commented Mar 4, 2021

@ferreirafabio What do you mean by "use the intrinsic rewards"? Can you point out which section in the paper stated that?

@ferreirafabio
Copy link
Author

ferreirafabio commented Mar 4, 2021

By that I mean reward = extrinsic reward + intrinsic reward. From the paper:

31B16992-8338-4C37-A1AE-6983E1EB9AF1

I now realize that the paper says the extrinsic reward can be optional. Wondering what is „usually“ used (with or without extrinsic reward) when peers use ICM as a baseline.

@ruoshiliu
Copy link

ruoshiliu commented Mar 4, 2021

Thank you for the clarification. Let me make sure I understand your question. What you are saying is the code (referenced above) tries to minimize the loss function by maximizing the extrinsic reward and minimizing the intrinsic reward. The correct implementation should reflect the equation (7) below in which

In other words, the correct implementation should find the policy p that maximizes both intrinsic and extrinsic reward and parameters for inverse model and forward model that minimizes L_I and L_F.

Did I interpret your question correctly?

Screen Shot 2021-03-04 at 16 40 59

@ferreirafabio
Copy link
Author

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants