[Question] Changing observation space during training #1157

pengzhi1998 · 2022-11-04T13:42:21Z

❓ Question

I have a question regarding changing the observation space during training.

As I'm using attention block to deal with a multi-agent task. While using attention could help me to easily vary the number for the agents, it seems Stable-baselines3 itself would report dimension error when changing (reset) the number of the landmarks (observation space). In this case, may I have your suggestions on how to achieve it? Thank you!

Checklist

I have checked that there is no similar issue in the repo
I have read the documentation
If code there is, it is minimal and working
If code there is, it is formatted using the markdown code blocks for both code and stack traces.

qgallouedec · 2022-11-04T13:54:09Z

Related: #1077 (comment)

qgallouedec · 2022-11-04T14:01:23Z

Following #1077 (comment), I would suggest to use a constant observation space size (equal to the largest possible observation). To do this, fill the inner observation (the one that varies in size) with zeros (or whatever values) to obtain a constant size outer observation (the one returned by step). You can also return in the info dict the associated mask. This way you stick to the paradigm of a gym environment whose observation and action space should not change.

pengzhi1998 · 2022-11-04T14:55:29Z

Thank you so much for this quick reply and your help!

But for example, if there are 8 agents while the maximum number is 10. In this case, do you suggest that the observation for the last two agents to be all 0s? But the 0s would also be fed into the network in this case which would affect the training and testing.

Do you mean to make use of the info to deal with this problem?

qgallouedec · 2022-11-04T14:59:32Z

But for example, if there are 8 agents while the maximum number is 10. In this case, do you suggest that the observation for the last two agents to be all 0s?

Yes.

But the 0s would also be fed into the network in this case which would affect the training and testing.

It won't if you mask it properly, see https://ai.stackexchange.com/questions/22957/how-can-transformers-handle-arbitrary-length-input

Do you mean to make use of the info to deal with this problem?

The mask could be returned with the info dict, yes.

pengzhi1998 · 2022-11-04T15:10:59Z

Thank you! But I'm still a little bit confused as it seems that the training has been wrapped by Stable-baselines3 framework? If I have the mask returned by the info dict with env.step, where could I use the information to modify the actual observation fed into the network during training?

I'm so sorry for keeping bothering you. Thank you again for your great help!

qgallouedec · 2022-11-04T15:53:08Z

Indeed, this will not work as is in SB3. You have to create your own feature extractor.

Thinking about it, I advise you to use a gym.spaces.Dict as an observation space with a key "observation" and a key "mask" (instead of using the info dict).
Then you create your own feature extractor (see documentation) in which you apply this masking we talked about.

And I think that's all you have to do. That's the easiest way in my opinion.

qgallouedec · 2022-11-04T15:54:23Z

If you manage to make this work, please share it here, it may help other people.

pengzhi1998 · 2022-11-04T16:16:44Z

Got it. And thank you so much!

pengzhi1998 · 2022-11-05T15:41:04Z

I have been using the CustomNetwork before and it worked well. Then I just tried the gym.spaces.Dict with the custom network. However, it reports the error of AttributeError: 'dict' object has no attribute 'flatten'.

Besides modifying the observation in the gym env. I have also changed the observation_space to gym.spaces.Dict as well when defining the policy. The error is still there. What other variables should I change?

I found there is an existing issue but it is not helpful as the problem could be solved by just using a MultiInputPolicy. While I need to use my custom policy instead of building based on that MultiInputPolicy.

qgallouedec · 2022-11-05T15:56:06Z

As I mentioned before, I recommend that you use a custom feature extractor (instead of a custom network), as your need does not seem to require this level of customization.
On the other hand, you should work on with Multiple Inputs and Dictionary Observations. Let me know if it works better.

pengzhi1998 · 2022-11-05T16:00:38Z

Thank you so much again for this quick reply.

But in my case, it's better to use the custom network with an attention block (not a custom feature extractor building the layers sequentially). But I'll first try your suggested Multiple Inputs and Dictionary Observations and see how it works.

But are these two able to combine with each other?

qgallouedec · 2022-11-05T16:11:40Z

as your need does not seem to require this level of customization

In fact I realize that this sentence is not clear at all. Let me explain it better:
As explained in Policy, the agent first extracts the features (with the feature extractor), and then uses a fully connected network to output the value and action.
In your case, you need to perform the masking at the input of the feature extractor. And you don't need to modify the fully connected network (unless you have other specific needs, but that doesn't belong in this thread).

pengzhi1998 · 2022-11-05T16:15:28Z

Oh, I got this.

Let me double-check: do you mean I first use a feature extractor to get the wanted observation based on the mask from the dictionary, then I input the masked observation to my custom policy network without any modification to obtain the policy and value?

pengzhi1998 · 2022-11-05T16:23:01Z

But is it possible to make use of them at the same time: use the feature extractor only for dealing with the observation and the custom policy network to compute the policy and value?

Maybe this would work?
model = PPO(CustomActorCriticPolicy, env, verbose=1, n_steps=2048, seed=args.seed, policy_kwargs=observation_extractor), with the observation_extractor as the feature extractor while the CustomActorCriticPolicy as my defined PPO custom network?

qgallouedec · 2022-11-05T22:03:14Z

Let me double-check: do you mean I first use a feature extractor to get the wanted observation based on the mask from the dictionary,

Yes.

then I input the masked observation to my custom policy network without any modification to obtain the policy and value?

No.
The masked observation is fed into a network within the feature extractor. Please read extensively the documentation about custom feature extractor.

pengzhi1998 · 2022-11-06T10:43:58Z

Thank you! Actually, the custom policy network is needed in my task (I need to use it to build an attention block not in a sequential manner, I consulted it before here). so I would keep the custom policy unchanged.

I just tried this command to vary the number of agents at each episode:
model = PPO(CustomActorCriticPolicy, env, verbose=1, n_steps=2048, seed=args.seed, policy_kwargs=observation_extractor), with the observation_extractor as the feature extractor to only mask the different length of observations (without any parameters) while the CustomActorCriticPolicy as my defined PPO custom network.
In this case, I don't need to modify much and could work based on the previous code. It could work but there are two problems with it.

when using the saved model for testing, one error would be reported: ValueError: Error: Unexpected observation shape () for Box environment, please use (1,) or (n_env, 1) for the observation shape. I'm still checking with it. But could this a problem caused by model saving? I think I have saved both the feature extractor (with no parameters) and custom policy network.
Another problem is that the collected observations with different lengths couldn't be stacked to a mini-batch tensor for training. For example, the observation tensors are in the shape of [1, 32], [1, 30], [1, 44] and etc, they couldn't form a minibatch for back-propagation as the sizes of dimension 1s are different. As you suggested, I think the mask and padding would be helpful. (Thank you so much for this link: https://ai.stackexchange.com/questions/22957/how-can-transformers-handle-arbitrary-length-input) But should I implement this on the network side?

May I have your advice? I'm so sorry for this inconvenience. And really really appreciate your great help!

qgallouedec · 2022-11-06T11:01:44Z

I think I'm at the end of what I can advise you, both in terms of knowledge and the time I can devote to it. Also, I think we're getting off track with this issue.
I don't think that this ValueError comes from saving your model. It's probably an implementation error in your model. If you think it's a SB3 bug, submit a new issue and provide minimal working code so we can work on it.

Another problem is that the collected observations with different lengths couldn't be stacked to a mini-batch tensor for training. For example, the observation tensors are in the shape of [1, 32], [1, 30], [1, 44]

I think that you haven't understood masking. All tensor observations must have the same size, and they are associated with a mask. (This is where their "intrinsic length" is encoded.) So there are never observation tensors with various shape. I advise you to read on the subject, and eventually to get help by asking your question on the discord if you can't.

pengzhi1998 · 2022-11-06T11:03:25Z

Got it. Really appreciate your help!

pengzhi1998 added the question Further information is requested label Nov 4, 2022

qgallouedec closed this as completed Nov 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Changing observation space during training #1157

[Question] Changing observation space during training #1157

pengzhi1998 commented Nov 4, 2022

qgallouedec commented Nov 4, 2022

qgallouedec commented Nov 4, 2022 •

edited

Loading

pengzhi1998 commented Nov 4, 2022

qgallouedec commented Nov 4, 2022 •

edited

Loading

pengzhi1998 commented Nov 4, 2022 •

edited

Loading

qgallouedec commented Nov 4, 2022 •

edited

Loading

qgallouedec commented Nov 4, 2022

pengzhi1998 commented Nov 4, 2022

pengzhi1998 commented Nov 5, 2022 •

edited

Loading

qgallouedec commented Nov 5, 2022

pengzhi1998 commented Nov 5, 2022 •

edited

Loading

qgallouedec commented Nov 5, 2022

pengzhi1998 commented Nov 5, 2022

pengzhi1998 commented Nov 5, 2022 •

edited

Loading

qgallouedec commented Nov 5, 2022

pengzhi1998 commented Nov 6, 2022

qgallouedec commented Nov 6, 2022

pengzhi1998 commented Nov 6, 2022

[Question] Changing observation space during training #1157

[Question] Changing observation space during training #1157

Comments

pengzhi1998 commented Nov 4, 2022

❓ Question

Checklist

qgallouedec commented Nov 4, 2022

qgallouedec commented Nov 4, 2022 • edited Loading

pengzhi1998 commented Nov 4, 2022

qgallouedec commented Nov 4, 2022 • edited Loading

pengzhi1998 commented Nov 4, 2022 • edited Loading

qgallouedec commented Nov 4, 2022 • edited Loading

qgallouedec commented Nov 4, 2022

pengzhi1998 commented Nov 4, 2022

pengzhi1998 commented Nov 5, 2022 • edited Loading

qgallouedec commented Nov 5, 2022

pengzhi1998 commented Nov 5, 2022 • edited Loading

qgallouedec commented Nov 5, 2022

pengzhi1998 commented Nov 5, 2022

pengzhi1998 commented Nov 5, 2022 • edited Loading

qgallouedec commented Nov 5, 2022

pengzhi1998 commented Nov 6, 2022

qgallouedec commented Nov 6, 2022

pengzhi1998 commented Nov 6, 2022

qgallouedec commented Nov 4, 2022 •

edited

Loading

qgallouedec commented Nov 4, 2022 •

edited

Loading

pengzhi1998 commented Nov 4, 2022 •

edited

Loading

qgallouedec commented Nov 4, 2022 •

edited

Loading

pengzhi1998 commented Nov 5, 2022 •

edited

Loading

pengzhi1998 commented Nov 5, 2022 •

edited

Loading

pengzhi1998 commented Nov 5, 2022 •

edited

Loading