Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Changing observation space during training #1157

Closed
4 tasks done
pengzhi1998 opened this issue Nov 4, 2022 · 18 comments
Closed
4 tasks done

[Question] Changing observation space during training #1157

pengzhi1998 opened this issue Nov 4, 2022 · 18 comments
Labels
question Further information is requested

Comments

@pengzhi1998
Copy link

❓ Question

I have a question regarding changing the observation space during training.

As I'm using attention block to deal with a multi-agent task. While using attention could help me to easily vary the number for the agents, it seems Stable-baselines3 itself would report dimension error when changing (reset) the number of the landmarks (observation space). In this case, may I have your suggestions on how to achieve it? Thank you!

Checklist

  • I have checked that there is no similar issue in the repo
  • I have read the documentation
  • If code there is, it is minimal and working
  • If code there is, it is formatted using the markdown code blocks for both code and stack traces.
@pengzhi1998 pengzhi1998 added the question Further information is requested label Nov 4, 2022
@qgallouedec
Copy link
Collaborator

Related: #1077 (comment)

@qgallouedec
Copy link
Collaborator

qgallouedec commented Nov 4, 2022

Following #1077 (comment), I would suggest to use a constant observation space size (equal to the largest possible observation). To do this, fill the inner observation (the one that varies in size) with zeros (or whatever values) to obtain a constant size outer observation (the one returned by step). You can also return in the info dict the associated mask. This way you stick to the paradigm of a gym environment whose observation and action space should not change.

@pengzhi1998
Copy link
Author

Thank you so much for this quick reply and your help!

But for example, if there are 8 agents while the maximum number is 10. In this case, do you suggest that the observation for the last two agents to be all 0s? But the 0s would also be fed into the network in this case which would affect the training and testing.

Do you mean to make use of the info to deal with this problem?

@qgallouedec
Copy link
Collaborator

qgallouedec commented Nov 4, 2022

But for example, if there are 8 agents while the maximum number is 10. In this case, do you suggest that the observation for the last two agents to be all 0s?

Yes.

But the 0s would also be fed into the network in this case which would affect the training and testing.

It won't if you mask it properly, see https://ai.stackexchange.com/questions/22957/how-can-transformers-handle-arbitrary-length-input

Do you mean to make use of the info to deal with this problem?

The mask could be returned with the info dict, yes.

@pengzhi1998
Copy link
Author

pengzhi1998 commented Nov 4, 2022

Thank you! But I'm still a little bit confused as it seems that the training has been wrapped by Stable-baselines3 framework? If I have the mask returned by the info dict with env.step, where could I use the information to modify the actual observation fed into the network during training?

I'm so sorry for keeping bothering you. Thank you again for your great help!

@qgallouedec
Copy link
Collaborator

qgallouedec commented Nov 4, 2022

Indeed, this will not work as is in SB3. You have to create your own feature extractor.

Thinking about it, I advise you to use a gym.spaces.Dict as an observation space with a key "observation" and a key "mask" (instead of using the info dict).
Then you create your own feature extractor (see documentation) in which you apply this masking we talked about.

And I think that's all you have to do. That's the easiest way in my opinion.

@qgallouedec
Copy link
Collaborator

If you manage to make this work, please share it here, it may help other people.

@pengzhi1998
Copy link
Author

Got it. And thank you so much!

@pengzhi1998
Copy link
Author

pengzhi1998 commented Nov 5, 2022

I have been using the CustomNetwork before and it worked well. Then I just tried the gym.spaces.Dict with the custom network. However, it reports the error of AttributeError: 'dict' object has no attribute 'flatten'.

Besides modifying the observation in the gym env. I have also changed the observation_space to gym.spaces.Dict as well when defining the policy. The error is still there. What other variables should I change?

I found there is an existing issue but it is not helpful as the problem could be solved by just using a MultiInputPolicy. While I need to use my custom policy instead of building based on that MultiInputPolicy.

@qgallouedec
Copy link
Collaborator

As I mentioned before, I recommend that you use a custom feature extractor (instead of a custom network), as your need does not seem to require this level of customization.
On the other hand, you should work on with Multiple Inputs and Dictionary Observations. Let me know if it works better.

@pengzhi1998
Copy link
Author

pengzhi1998 commented Nov 5, 2022

Thank you so much again for this quick reply.

But in my case, it's better to use the custom network with an attention block (not a custom feature extractor building the layers sequentially). But I'll first try your suggested Multiple Inputs and Dictionary Observations and see how it works.

But are these two able to combine with each other?

@qgallouedec
Copy link
Collaborator

as your need does not seem to require this level of customization

In fact I realize that this sentence is not clear at all. Let me explain it better:
As explained in Policy, the agent first extracts the features (with the feature extractor), and then uses a fully connected network to output the value and action.
In your case, you need to perform the masking at the input of the feature extractor. And you don't need to modify the fully connected network (unless you have other specific needs, but that doesn't belong in this thread).

@pengzhi1998
Copy link
Author

Oh, I got this.

Let me double-check: do you mean I first use a feature extractor to get the wanted observation based on the mask from the dictionary, then I input the masked observation to my custom policy network without any modification to obtain the policy and value?

@pengzhi1998
Copy link
Author

pengzhi1998 commented Nov 5, 2022

But is it possible to make use of them at the same time: use the feature extractor only for dealing with the observation and the custom policy network to compute the policy and value?

Maybe this would work?
model = PPO(CustomActorCriticPolicy, env, verbose=1, n_steps=2048, seed=args.seed, policy_kwargs=observation_extractor), with the observation_extractor as the feature extractor while the CustomActorCriticPolicy as my defined PPO custom network?

@qgallouedec
Copy link
Collaborator

Let me double-check: do you mean I first use a feature extractor to get the wanted observation based on the mask from the dictionary,

Yes.

then I input the masked observation to my custom policy network without any modification to obtain the policy and value?

No.
The masked observation is fed into a network within the feature extractor. Please read extensively the documentation about custom feature extractor.

@pengzhi1998
Copy link
Author

Thank you! Actually, the custom policy network is needed in my task (I need to use it to build an attention block not in a sequential manner, I consulted it before here). so I would keep the custom policy unchanged.

I just tried this command to vary the number of agents at each episode:
model = PPO(CustomActorCriticPolicy, env, verbose=1, n_steps=2048, seed=args.seed, policy_kwargs=observation_extractor), with the observation_extractor as the feature extractor to only mask the different length of observations (without any parameters) while the CustomActorCriticPolicy as my defined PPO custom network.
In this case, I don't need to modify much and could work based on the previous code. It could work but there are two problems with it.

  1. when using the saved model for testing, one error would be reported: ValueError: Error: Unexpected observation shape () for Box environment, please use (1,) or (n_env, 1) for the observation shape. I'm still checking with it. But could this a problem caused by model saving? I think I have saved both the feature extractor (with no parameters) and custom policy network.

  2. Another problem is that the collected observations with different lengths couldn't be stacked to a mini-batch tensor for training. For example, the observation tensors are in the shape of [1, 32], [1, 30], [1, 44] and etc, they couldn't form a minibatch for back-propagation as the sizes of dimension 1s are different. As you suggested, I think the mask and padding would be helpful. (Thank you so much for this link: https://ai.stackexchange.com/questions/22957/how-can-transformers-handle-arbitrary-length-input) But should I implement this on the network side?

May I have your advice? I'm so sorry for this inconvenience. And really really appreciate your great help!

@qgallouedec
Copy link
Collaborator

I think I'm at the end of what I can advise you, both in terms of knowledge and the time I can devote to it. Also, I think we're getting off track with this issue.
I don't think that this ValueError comes from saving your model. It's probably an implementation error in your model. If you think it's a SB3 bug, submit a new issue and provide minimal working code so we can work on it.

Another problem is that the collected observations with different lengths couldn't be stacked to a mini-batch tensor for training. For example, the observation tensors are in the shape of [1, 32], [1, 30], [1, 44]

I think that you haven't understood masking. All tensor observations must have the same size, and they are associated with a mask. (This is where their "intrinsic length" is encoded.) So there are never observation tensors with various shape. I advise you to read on the subject, and eventually to get help by asking your question on the discord if you can't.

@pengzhi1998
Copy link
Author

Got it. Really appreciate your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants