[Question] Why does MaskablePPO does not mask with some logic with last observation? #234

EloyAnguiano · 2024-03-21T08:18:47Z

❓ Question

At MaskablePPO class, the change for getting the masks is to ask the environment to provide it by he function get_action_mask. I can see that the get_action_mask only gets the environment object as input, but at that point we also have the self._last_obs variable. To provide the action mask more information about the observation it is facing, It would be interesting to provide that method with the last observation object, isn't it? I am thinking about a game that has some logic that we want to keep and code to prevent the agent making those actions.

I assume that I am not the first thinking this so, is it a performance killer to do like so? Has it something to do with the environment vectorizations?

Checklist

I have checked that there is no similar issue in the repo
I have read the documentation
If code there is, it is minimal and working
If code there is, it is formatted using the markdown code blocks for both code and stack traces.

The text was updated successfully, but these errors were encountered:

araffin · 2024-03-31T10:56:29Z

It would be interesting to provide that method with the last observation object, isn't it? I am thinking about a game that has some logic that we want to keep and code to prevent the agent making those actions.

You can add that logic in the environment code, no? (that action mask may depend on previous observation or any other variable that represent the current env)

stable-baselines3-contrib/sb3_contrib/common/maskable/evaluation.py

Line 94 in 667a789

action_masks = get_action_masks(env)

EloyAnguiano · 2024-04-03T11:45:40Z

Yes, and I am doing that indeed, but the problem with this order of things is that you have to calculate the action mask for time t usint the observation at t-1, and changing this order could be useful to code come logic at mask t with the observation at t (even at the t=0 case)

araffin · 2024-04-03T11:53:40Z

this order could be useful to code come logic at mask t with the observation at t (even at the t=0 case)

This is what is currently done, no? (the action mask depends on obs at t)

EloyAnguiano · 2024-04-05T07:49:42Z

Yes, you are right. It was a mistake at my environment code.

EloyAnguiano added the question Further information is requested label Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Why does MaskablePPO does not mask with some logic with last observation? #234

[Question] Why does MaskablePPO does not mask with some logic with last observation? #234

EloyAnguiano commented Mar 21, 2024

araffin commented Mar 31, 2024

EloyAnguiano commented Apr 3, 2024

araffin commented Apr 3, 2024

EloyAnguiano commented Apr 5, 2024

[Question] Why does MaskablePPO does not mask with some logic with last observation? #234

[Question] Why does MaskablePPO does not mask with some logic with last observation? #234

Comments

EloyAnguiano commented Mar 21, 2024

❓ Question

Checklist

araffin commented Mar 31, 2024

EloyAnguiano commented Apr 3, 2024

araffin commented Apr 3, 2024

EloyAnguiano commented Apr 5, 2024