Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creative action space support: contains method, action interpoalation. #4837

Open
ragavvenkatesan opened this issue May 22, 2019 · 6 comments
Labels
P3 Issue moderate in impact or severity

Comments

@ragavvenkatesan
Copy link

Description of the problem

There are two ways of creatively engineering action spaces:

  1. OpenAI gym actions has the following methods that the agents should internally use: sample and contains. The RL agents should adhere to these methods. If for instance, the RL agent only produces actions that are restricted by the contains method, then it will be easy to define any complicated action space with the contains method.

  2. In parametric environments, along with action_masking, there must be a provision to return back to the agent, what action was actually used. In this case, if the agent provides an invalid action and inside the step method, the action was decoded, translated or interpolated to a valid one, the step method should return the used action which should be consumed by the agent for further back propagation. By default, this could be the same action that the agent rolled out.

Either of these facilities will provide a lot of capability of creatively engineering complex action spaces including for Safe RL techniques.

@ericl
Copy link
Contributor

ericl commented May 24, 2019

OpenAI gym actions has the following methods that the agents should internally use: sample and contains. The RL agents should adhere to these methods. If for instance, the RL agent only produces actions that are restricted by the contains method, then it will be easy to define any complicated action space with the contains method.

IIUC, the proposal is to use .contains() as a boolean function to see if the action is valid, and if not, try sampling a new action? One issue I can see is that this could be very slow to sample if the containment is restrictive.
By the way, for any built-in action space we already clip actions to the space so that contains() should always be true (clip_actions=True by default).

In parametric environments, along with action_masking, there must be a provision to return back to the agent, what action was actually used. In this case, if the agent provides an invalid action and inside the step method, the action was decoded, translated or interpolated to a valid one, the step method should return the used action which should be consumed by the agent for further back propagation. By default, this could be the same action that the agent rolled out.

This sounds pretty useful. Perhaps it could be an extra method of the environment, preprocess_action(), which can return a modified action. If the action is modified, then we use that one for learning instead of the sampled one.

@ragavvenkatesan
Copy link
Author

I don't know what 'IIUC' stand for.
Even though the built-in action space clips, it is still different from contains method. Consider a space where the actual space is a Box space between (0,1), but the contains will return false if the action is between (0.5, 0.6). This is a complex space that can't be implemented by using simple box.

The way, I suggest to implement the second method is not on the environment but on the actions class. The environment should ideally be agnostic of how actions are handled. The way I am visualizing this is to have a decode method on the action class. Similar to how the contains method returns a bool, the decode method should consume an action and return another action. Ideally, the decode method simply returns the input argument action back, but you can use the decode method as an interpolator for solving Safe RL problems, for instance.

The RLLib, internally can use the actions's decode method to infer what the actual action used was and backpropagate only that.

@ragavvenkatesan
Copy link
Author

I am willing to contribute these things. I am new to the library so still getting the hang of it. If you can provide some insight on where this will fit design-wise and if we agree on the design itself, I am willing to do this myself.

I need this anyway for the research I am doing now. so.. :D

@ericl
Copy link
Contributor

ericl commented Jun 5, 2019

@ragavvenkatesan it would be great to add this.

The way, I suggest to implement the second method is not on the environment but on the actions class.

Makes sense. How would you do this, would the user be responsible for subclassing the right gym action space to add decode?

Ideally, the decode method simply returns the input argument action back, but you can use the decode method as an interpolator for solving Safe RL problems, for instance.

Sounds good. Btw the decode can probably be done right after where clip_action() is called in sampler.py.

ps: IIUC == if i understand correctly

@ragavvenkatesan
Copy link
Author

ragavvenkatesan commented Jun 5, 2019

Yes, I suppose there will be a new method added to sub-classed action classes called decode or even __call__. The problem in that case is it is not going to be backward compatible to existing Gym classes unless we look for a method called decode: if it doesn't exist, behave as is, otherwise, call it.
If I am correct, this logic should go somewhere here.

Alternatively, if we want a smoother backward compatibility with all gym classes, the Parameteric Environment could be made to return along with avail_actions a key called used_actions. This is a little uglier because the environment has to get dirty with actions and has the decoding mechanism tied into it instead of in actions. I am swinging between these two options.

@ericl
Copy link
Contributor

ericl commented Jun 5, 2019 via email

@ericl ericl added the P3 Issue moderate in impact or severity label Mar 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 Issue moderate in impact or severity
Projects
None yet
Development

No branches or pull requests

2 participants