Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Pretraining policy using BC anc continue training using SB3 #543

Closed
2 tasks done
hectorIzquierdo opened this issue Aug 17, 2021 · 3 comments
Closed
2 tasks done
Labels
question Further information is requested

Comments

@hectorIzquierdo
Copy link

hectorIzquierdo commented Aug 17, 2021

Question

Hello,

I'm trying to pretrain a policy using BC to later on continue training it by using SB3. I read #27 and I followed the Colab notebook mentioned there by @araffin https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/pretraining.ipynb. That worked fine. The problem comes when continuing to learn using SB3. While the average episode reward when using the pretrained policy is very high, after training a few more steps with RL it decreases dramatically. It's like pre-training has been of no use at all. My question is: is this the correct approach using SB3 and thefore, it should be working fine, and the problem is related with my particular case? Or is there something I'm missing?

Additional context

This is the code I'm using. The pretrain_agent is the one used in the Colab Notebook I just shared. The observation space is an image (box 45x45x1, 0-255) and the action space a discrete(3112).

Initialize model

env = RSEnv()
model = A2C('CnnPolicy', env, verbose = 1, n_steps = 1, seed = 1, learning_rate = 0.0007)

Pretrain model

pretrain_agent(model, batch_size = 64, epochs = 30, learning_rate = 5.0, test_batch_size = 64)

Continue training

model.learn(10000)

Checklist

  • I have read the documentation (required)
  • I have checked that there is no similar issue in the repo (required)
@hectorIzquierdo hectorIzquierdo added the question Further information is requested label Aug 17, 2021
@Miffyli
Copy link
Collaborator

Miffyli commented Aug 17, 2021

This goes more into the "exploration" side of projects and we do not have the right answers. However it is somewhat known that doing RL training after BC will quickly erase the behaviour from BC training. Some suggestions:

  • Try bigger batch-size and/or n-steps (more data before first RL updates)
  • Requires modifications to the code, but try first only training the value head for some time to initialize it with the correct value estimate of the BC model (similar done e.g. in this SC2 paper).
  • Requires modifications to the code, but add KL-penalty loss which aims to keep the RL model outputs close to BC model outputs (similar done e.g. in this SC2 paper and this competition submission)

You could also try asking on e.g. RL Discord.

@araffin
Copy link
Member

araffin commented Aug 23, 2021

after training a few more steps with RL it decreases dramatically. It's like pre-training has been of no use at all.

I recommend you checking offline RL to understand the issues you are facing (start with the BCQ paper ;) ) and probably read that paper: https://arxiv.org/abs/2006.09359

@araffin araffin closed this as completed Aug 30, 2021
@hectorIzquierdo
Copy link
Author

Ok, thank you so much for your help. I'll take a look to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants