Add support for pretraining [feature request] #27

skervim · 2020-05-22T08:28:32Z

First: I'm very happy to see the new PyTorch SB3 version! Great job!

My question is whether pretraining-support is planned for SB3 (like for SB: https://stable-baselines.readthedocs.io/en/master/guide/pretrain.html). I couldn't find it being mentioned in the Roadmap.

In my opinion it is a very valuable feature!

araffin · 2020-05-22T08:43:23Z

My question is whether pretraining-support is planned for SB3 (like for SB: https://stable-baselines.readthedocs.io/en/master/guide/pretrain.html). I couldn't find it being mentioned in the Roadmap.

As mentioned in the design choices (see hill-a/stable-baselines#576), everything that is related to imitation learning (it includes GAIL and the pretraining using behavior cloning) will be done outside (certainly in this repo: https://github.com/HumanCompatibleAI/imitation by @AdamGleave et al.).

Otherwise, you can check that repo https://github.com/joonaspu/video-game-behavioural-cloning by @Miffyli et al. where pre-training is done using PyTorch.

We may add an example though (and maybe include it in the zoo), as it is simple to implement in some cases.

araffin · 2020-05-22T08:48:52Z

@skervim we would be happy if you could provide such example ;) (maybe as a colab notebook)

Miffyli · 2020-05-22T09:26:00Z

With SB3, I think this should be off-loaded to users indeed. The SB's pretrain function was promising but it was somewhat limiting. With SB3 we could provide interfaces to obtain a policy of right shape given an environment, then user can take this policy and do their own imitation learning (e.g. supervised learning on some dataset of demonstrations), and upload those parameters to policy.

araffin · 2020-05-22T09:32:56Z

With SB3 we could provide interfaces to obtain a policy of right shape given an environment,

This is already the case, no?

Miffyli · 2020-05-22T09:35:37Z

With SB3 we could provide interfaces to obtain a policy of right shape given an environment,

This is already the case, no?

Fair point, it is not hidden per-se, one just needs to know what to access to obtain this policy. An example code of this in the docs should do the trick :)

skervim · 2020-05-22T10:00:59Z

I'm not completely sure if I am following. In case of behavioral cloning, you two suggest something like the following?

"""
Example code for behavioral cloning
"""
from stable_baselines3 import PPO
import gym

# Initialize environment and agent
env = gym.make("MountainCarContinuous-v0")
ppo = PPO("MlpPolicy", env)

# Extract initial policy
policy = ppo.policy

# Perform behavioral cloning with external code
pretrained_policy = external_supervised_learning(policy, external_dataset)

# Insert pretrained policy back into agent
ppo.policy = pretrained_policy

# Perform training
ppo.learn(total_timesteps=int(1e6))

araffin · 2020-05-22T10:16:18Z

I'm not completely sure if I am following. In case of behavioral cloning, you two suggest something like the following?

yes. In practice, because ppo.policy is an object, it could be modified by reference, so policy = ppo.policy and ppo.policy = pretrained_policy could be removed (even though it is cleaner written like you did).

skervim · 2020-05-22T11:27:49Z

FYI, my use case is that I have a custom environment and would like to pretrain an SB3 ppo agent with an expert dataset that I have created for that environment in a simple behavioral cloning fashion. Then I would like to continue training the pretrained agent.

I would gladly provide an example, as suggested by @araffin, but I'm not completely sure how it should look like.

Is @AdamGleave's https://github.com/HumanCompatibleAI/imitation going to support SB3 soon? In that case, should the part:

# Perform behavioral cloning with external code
pretrained_policy = external_supervised_learning(policy, external_dataset)

be implemented there and then an example should be created in the SB3 documentation?

Which parts are needed for such an implementation?

Code to create an expert data set by simulating an environment (with some agent/policy) and storing observations and actions (like: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/gail/dataset/record_expert.py)
Code to represent an expert data set, and to provide batches, shuffling etc. (like: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/gail/dataset/dataset.py). Should this be written from scratch, or reused? Or should this be used: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/common/dataset.py ?
PyTorch code to perform supervised learning.

Am I missing anything? I would like to contribute back to the repository and try to work on this task, however I think I would need some hint on how to start and could benefit from some guidance of those who have already worked on this problem.

araffin · 2020-05-22T11:57:41Z

be implemented there and then an example should be created in the SB3 documentation?

@AdamGleave is busy with NeurIPS deadline... so better to just create a stand-alone example as a colab notebook here (SB3 branch).

Code to create an expert data set by simulating an environment (with some agent/policy) and storing observations and actions

Usually people have their own format, but yes the dataset creation code from SB2 can be reused (it is not depending on TF at all).

Code to represent an expert data set, and to provide batches, shuffling etc.

Yes, but this will be contained in the training loop normally. (the SB2 code can be simplified as we don't support GAIL)
I'm not sure we need a class for that in a stand-alone code.

PyTorch code to perform supervised learning.

your 2nd and 3rd point can be merged into one I think.

araffin · 2020-05-22T11:59:02Z

Last thing, it is not documented yet, but policies can be saved and loaded without a model now ;).

EDIT: model = PPO("MlpPolicy", "MountainCarContinuous-v0") works too

skervim · 2020-05-22T12:28:40Z

Alright, thanks for the clarifications.
I will try to implement a simple standalone example, and PR it as a colab notebook to the SB3 branch when I have it working!

araffin · 2020-05-26T09:15:14Z

@skervim I updated the notebook and added support for discrete actions + SAC/TD3

You can try the notebook online here

We just need to update the documentation and we can close this issue.

skervim · 2020-05-28T05:32:19Z

@araffin: Glad that I could contribute, and happy to have learned something new from your improvements to the notebook :)

flint-xf-fan · 2020-08-15T06:40:09Z

I want to ask something related to this. Instead of generating "expert data" after the teacher has been trained, how do I directly save the trajectory of the teacher during training as the "expert data", and use that data to train my student?

flint-xf-fan · 2020-08-15T09:27:41Z

@skervim I updated the notebook and added support for discrete actions + SAC/TD3

You can try the notebook online here

We just need to update the documentation and we can close this issue.

I downloaded the notebook and run on RTX2070 GPU with CUDA10.1 on Ubuntu 18.04. The whole notebook is working fine except for the last cell of evaluating the policy giving the error the following error. Any hints?

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
 in 
----> 1 mean_reward, std_reward = evaluate_policy(a2c_student, env, n_eval_episodes=10)
      2 
      3 print(f"Mean reward = {mean_reward} +/- {std_reward}")

~/anaconda3/envs/sb3-torch1.6/lib/python3.6/site-packages/stable_baselines3/common/evaluation.py in evaluate_policy(model, env, n_eval_episodes, deterministic, render, callback, reward_threshold, return_episode_rewards)
     37         episode_length = 0
     38         while not done:
---> 39             action, state = model.predict(obs, state=state, deterministic=deterministic)
     40             obs, reward, done, _info = env.step(action)
     41             episode_reward += reward

~/anaconda3/envs/sb3-torch1.6/lib/python3.6/site-packages/stable_baselines3/common/base_class.py in predict(self, observation, state, mask, deterministic)
    287             (used in recurrent policies)
    288         """
--> 289         return self.policy.predict(observation, state, mask, deterministic)
    290 
    291     @classmethod

~/anaconda3/envs/sb3-torch1.6/lib/python3.6/site-packages/stable_baselines3/common/policies.py in predict(self, observation, state, mask, deterministic)
    155         observation = observation.reshape((-1,) + self.observation_space.shape)
    156 
--> 157         observation = th.as_tensor(observation).to(self.device)
    158         with th.no_grad():
    159             actions = self._predict(observation, deterministic=deterministic)

RuntimeError: CUDA error: an illegal memory access was encountered

Miffyli · 2020-08-15T10:31:42Z

I want to ask something related to this. Instead of generating "expert data" after the teacher has been trained, how do I directly save the trajectory of the teacher during training as the "expert data", and use that data to train my student?

Easiest way to do this would be to save states and actions in the environment, e.g. some kind of a wrapper that keeps track of states and actions and saves them into a file once done is encountered.

I downloaded the notebook and run on RTX2070 GPU with CUDA10.2 on Ubuntu 10.2. The whole notebook is working fine except for the last cell of evaluating the policy giving the error the following error. Any hints?

I have no idea what could cause that, sorry :/

flint-xf-fan · 2020-08-15T13:50:18Z

Easiest way to do this would be to save states and actions in the environment, e.g. some kind of a wrapper that keeps track of states and actions and saves them into a file once done is encountered.

thanks

I have no idea what could cause that, sorry :/

Ah, np. It seems to be from pytorch's side.

araffin added the enhancement New feature or request label May 22, 2020

araffin added documentation Improvements or additions to documentation help wanted Help from contributors is welcomed labels May 22, 2020

skervim mentioned this issue May 25, 2020

Add exemplary notebook for behavior cloning Stable-Baselines-Team/rl-colab-notebooks#3

Merged

araffin removed enhancement New feature or request help wanted Help from contributors is welcomed labels May 26, 2020

araffin mentioned this issue May 27, 2020

Roadmap to Stable-Baselines3 V1.0 #1

Closed

42 tasks

araffin closed this as completed Aug 3, 2020

araffin mentioned this issue Aug 14, 2020

Pre_Train in Stablebaselines3 #139

Closed

araffin mentioned this issue Dec 2, 2020

model.pretrain() using DDPG+HER #249

Closed

hectorIzquierdo mentioned this issue Aug 17, 2021

[Question] Pretraining policy using BC anc continue training using SB3 #543

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for pretraining [feature request] #27

Add support for pretraining [feature request] #27

skervim commented May 22, 2020

araffin commented May 22, 2020

araffin commented May 22, 2020

Miffyli commented May 22, 2020

araffin commented May 22, 2020

Miffyli commented May 22, 2020

skervim commented May 22, 2020

araffin commented May 22, 2020

skervim commented May 22, 2020

araffin commented May 22, 2020 •

edited

Loading

araffin commented May 22, 2020 •

edited

Loading

skervim commented May 22, 2020

araffin commented May 26, 2020

skervim commented May 28, 2020

flint-xf-fan commented Aug 15, 2020

flint-xf-fan commented Aug 15, 2020 •

edited

Loading

Miffyli commented Aug 15, 2020

flint-xf-fan commented Aug 15, 2020

Add support for pretraining [feature request] #27

Add support for pretraining [feature request] #27

Comments

skervim commented May 22, 2020

araffin commented May 22, 2020

araffin commented May 22, 2020

Miffyli commented May 22, 2020

araffin commented May 22, 2020

Miffyli commented May 22, 2020

skervim commented May 22, 2020

araffin commented May 22, 2020

skervim commented May 22, 2020

araffin commented May 22, 2020 • edited Loading

araffin commented May 22, 2020 • edited Loading

skervim commented May 22, 2020

araffin commented May 26, 2020

skervim commented May 28, 2020

flint-xf-fan commented Aug 15, 2020

flint-xf-fan commented Aug 15, 2020 • edited Loading

Miffyli commented Aug 15, 2020

flint-xf-fan commented Aug 15, 2020

araffin commented May 22, 2020 •

edited

Loading

araffin commented May 22, 2020 •

edited

Loading

flint-xf-fan commented Aug 15, 2020 •

edited

Loading