-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
State-action value function #274
Comments
Hi @paLeziart ! Could you describe what you're trying to accomplish with this? Is this so that gradients of the critic affect the actor encoder, but not the other way around? I.e. you want separate-weights actor critic, but not entirely separate? Sorry, I'm not quite sure what this architecture is doing. Also, just in case, there's a parameter |
Hi @alex-petrenko ! Thank you for your answer! Sorry the confusion might be on my side. From what I understand the goal of the critic is to estimate the value function, which is the advantage function in the case of the PPO, i.e. how "good" an action is depending on the current state. Then, considering that the architecture of the network is I guess I don't understand well the behaviour of that linear critic and how it is connected to the rest. I am aware of the Sorry to bother, I guess it's a simple thing I am not seeing and the critic can actually properly estimate the advantage, otherwise this whole thing would not work for the provided examples. |
Not exactly, "value function" is the estimate of discounted return from current moment to the end of the episode.
I think you misinterpreted some of the code. The architecture should be the following. For shared weights actor-critic --actor_critic_share_weights=False: For separate actor-critic (--actor_critic_share_weights=True):
You are right to note that in neither scheme the critic has access to actions. This is by design. Critic represents that state-based value function (aka "value function", or I don't have a good intuition about what happens if you add the currently sampled actions to state. Mathematical formulation requires the value function to predict the value given the environment state, and the actions taken by the agent are not a part of this state. Unless I'm missing something, you probably don't even want that. If you actually want to do that (which I think you do not), you would have to add current actions to state in a rather hacky way. You'd first need to sample the actions during forward pass (when your agent decides what to do in the environment), and then during training you would need to add the same actions to state to properly estimate the gradients. Again, I would not advise you to do that.
The advantages are calculated here given the value estimate and bootstrapped discounted return:
As you can see, no need to add actions to state. Let me know if this is a satisfactory explanation. There is a good chance I didn't fully understand your idea. Good luck! |
@alex-petrenko Thank you for your detailed answer! I was indeed a bit confused about the shared/not-shared pipelines and what the critic was trying to learn ( |
Glad I could help! |
Hello,
I had a quick question about the form of the value function. Right now by default it is an action value function with a linear layer that receives the output of the decoder. I was wondering if it would be possible to get a state-action value function without undesired side effects. I tried an implementation with a custom model and it seems to run fine, but I'm not sure of what is happening under the hood with the subclasses so I was wondering if it can cause issues that I have not seen yet.
I tried something as follows. Most of
CustomActorCriticSharedWeights
is a copy-paste ofActorCriticSharedWeights
, with minor changes. The network is a simple two layers MLP of size 128. The gist of it is that the core now also outputs its input in its forward functiontorch.cat((self.core_head(head_output), head_output), dim=1)
, i.e the state and action concatenated. That way the value function receives the state-action, and I just give the action part to the action parametrizationself.action_parameterization(decoder_output[:, : -self.n_encoder_out])
.Do you think it is the proper way to do it? Could it have any ill-effect? Thanks for the help!
For indication, here's the network summary outputted in the terminal. The state is of size 25, the action is of size 6.
The text was updated successfully, but these errors were encountered: