You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found a possible bug/unwanted behaviour when I wanted to train a PPOAgent on TicTacToe with Masking.
In the file agents/ppo/ppo_policy.py on line 237, time_step is first normalized, this observation is then fed into the _actor_network. However, if the actual ActorDistributionNetwork is wrapped inside a MaskSplitterNetwork, the observation at this point in time, when the function below is called, contains the mask which leads to the mask being normalized. The fact, that the dtype of the mask is either int or bool, the normalization and rounding afterwards leads to wrong masks applied at the ActorDistributionNetwork.
In the case of my TicTacToe Environment, the masking was simply wrong as soon as one training-step was applied and the normalization function updated.
This should
either be properly documented if it is wanted behaviour (that you cannot use masks when observation normalization is turned on with the PPOAgent (which it is by default)),
or fixed, such that the mask is excluded from the normalization.
I found a possible bug/unwanted behaviour when I wanted to train a PPOAgent on TicTacToe with Masking.
In the file agents/ppo/ppo_policy.py on line 237, time_step is first normalized, this observation is then fed into the _actor_network. However, if the actual ActorDistributionNetwork is wrapped inside a MaskSplitterNetwork, the observation at this point in time, when the function below is called, contains the mask which leads to the mask being normalized. The fact, that the dtype of the mask is either int or bool, the normalization and rounding afterwards leads to wrong masks applied at the ActorDistributionNetwork.
In the case of my TicTacToe Environment, the masking was simply wrong as soon as one training-step was applied and the normalization function updated.
This should
either be properly documented if it is wanted behaviour (that you cannot use masks when observation normalization is turned on with the PPOAgent (which it is by default)),
or fixed, such that the mask is excluded from the normalization.
The text was updated successfully, but these errors were encountered: