Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] StableBaselines Agent #148

Merged
merged 15 commits into from
Apr 22, 2022
Merged

[MRG] StableBaselines Agent #148

merged 15 commits into from
Apr 22, 2022

Conversation

mmcenta
Copy link
Collaborator

@mmcenta mmcenta commented Mar 3, 2022

I'm finishing work on the StableBaselines wrapper that I was working on before. Unfortunately, SB3 algorithms don't seem to reliably satisfy check_fit_additive, so I removed it from the tests.

Let me know what you think!

@mmcenta mmcenta self-assigned this Mar 3, 2022
@mmcenta mmcenta added the enhancement New feature or request label Mar 3, 2022
@mmcenta
Copy link
Collaborator Author

mmcenta commented Mar 3, 2022

It seems tests are failing because stable_baselines isn't a requirement. Do we set up an optional requirement such as with torch?

@TimotheeMathieu
Copy link
Collaborator

Yes I think it is ok to add it as optional requirement because I think the people that install torch will also want to install stablebaseline.
My only concern is about the install time when doing the tests but I guess it should not be a huge increase because a lot of stablebaselines's dependencies are already dependencies of rlberry.

@mmcenta
Copy link
Collaborator Author

mmcenta commented Mar 11, 2022

@omardrwch mentioned that maybe we can create a dummy SB3 algorithm class and avoid the import issues. I will look into this.

Edit: Actually, it was a recommendation to use only one SB3 algorithm in the tests - done!

@mmcenta mmcenta changed the title [WIP] StableBaselines Agent [MRG] StableBaselines Agent Mar 11, 2022
rlberry/utils/tests/test_sb_agent.py Outdated Show resolved Hide resolved
@TimotheeMathieu TimotheeMathieu added this to To Do in V 0.3.0 Mar 15, 2022
Copy link
Collaborator

@TimotheeMathieu TimotheeMathieu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks nice !
I have just a small comment on the test and some comment on the doc.
Concerning the doc, could you move the link to the stablebaselines tuto (and also the link to gym tuto) from the index page to the user guide ? I think it would make more sense there. Also could you add the wrapper in the API (file docs/api.rst) ?

Comment on lines 20 to 21
Does not check that the fit is additive, because that doesn't seem to be
always the case with StableBaselines3.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be due to the callback that can stop the fit early (e.g. https://stable-baselines.readthedocs.io/en/master/guide/callbacks.html _on_step method). Would it make sense to redefine the default stablebaselines callback in order to avoid stopping early ? And is this early stopping a problem for Hyperparameter optimization ? (also @omardrwch on this)

Copy link
Collaborator Author

@mmcenta mmcenta Mar 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I just added the agent to the API and I've rewritten the documentation page for it so it looks more like a tutorial. Soon, I will move the links as you asked.

About the fit additive thing, it is kind of a lie: I'm pretty sure that SB3 should support additive fitting, it just doesn't work when I test it. I actually spent some time doing the following experiment:

  1. Create two identical agents (same seed too);
  2. Train one for X steps;
  3. Compare their policy logits on a fixed trajectory (policy logits are not sampling anything so we eliminate a source of randomness here).

And what I got was a bit difficult to explain, but I will summarize here:

  • For our agents, the agents have the same logits when X is small (<10) but for bigger values, it is clear that they are different.
  • For SB3 agents... There doesn't seem to be a lot of consistency: sometimes they have identical results even though X = 100 and sometimes they have different results when they should be equal. The additive fit constraint is only respected sometimes. I also found out that adding more training steps can change results, for example, agents trained for 100 and 200 steps give different results but agents trained for 500 and 600 steps give the same results...

In the end, I shelved the whole thing and ignored the test! I think that there are some interesting questions in checking if RL agents are almost equal, namely:

  • How to set an appropriate budget different such that the algorithms can be differentiated? Budget usually refers to the amount of collected experience, but the agent may spend some time collecting experience before updating its weights;
  • How does the environment affect the testing? I had some problems with small environments such as Chain-v0 where the agent learns to always go right very quickly. As a result, differentiating between two trained agents is difficult with policy samples since the policies are so similar. Additionally, how should we design an environment for such tests?
  • Should we compare the policies on which states? Currently, we use only initial states, but I also had problems when using environments that had simple optimal policies on the initial states. I ended up using states obtained by running one of the policies one time.

This investigation was part of my looking into the deep RL agents, and I noticed some things that we could improve in the seeding (namely seeding the external seeds on the __init__ method of torch agents to make sure that the random weight initialization is the same). However, even though I spent quite some time aggressively seeding everything, I still cannot get identical training runs (same episode lengths, returns) which may indicate that there are still things to correct.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on this

  • When you are doing your tests, do you use one thread, not parallelized ? Parallelization may have some unexpected consequences and this is why I used non-parallelized agents in the tests.
  • For the stablebaselines algo I think this is linked to the early stopping they perfom, which seems to me a bit ad hoc. I don't know how well we know what is the tradeoff behind this kind of early stopping. It is not clear for me, which is why I think we may want to avoid this early stopping if we are searching for stable/reliable/reproducible results.
  • How does the environment affect the testing?
    I think you are right, we one way would be to test on an environment where there are a lot of possible action so that at first the agent have to basically do pure exploration. Maybe we could design an ad-hoc environment with finiteMDP class to do that. Another way would be to test the agent on a bandit environment (bandit environments may have a lot of possible actions)
  • Generally you are right that the budget is something not very precise. It is by design in rlberry to have something flexible but it makes testing a bit awkward. One way to solve this would be to specify a set of fit arguments (budget, n_simulations...) for each algorithm in the test, it would be a bit tedious but still doable.

About fixing external seed in __init__ I think there may be some caveat linked to the parallelization and the spawning of new seeds in AgentManager and multimanager. It would be great if all the seeding in rlberry could be handled by rlberry and not manually by using set_external_seed but I am not sure how easy it is.

Copy link
Collaborator Author

@mmcenta mmcenta Mar 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some quick responses:

  • Unfortunately, the tests are indeed all single-threaded;
  • I didn't realize that they applied early stopping by default, this might explain it indeed;
  • I thought about doing something similar to what you suggested. Namely, I wanted to write some tools/environments to make these "almost equal" tests;
  • I think that it is best to change the budget per test. The current value works well for the basic agents but I am 80% sure it won't work for all Deep RL algorithms.
  • That's a good point about using global seeds, but unfortunately torch doesn't have Seeder objects so we have to use it for the Torch Agents and the StableBaselinesAgent.

I will take a look into this early stopping issue.

Copy link
Collaborator

@TimotheeMathieu TimotheeMathieu Mar 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking into it, I think the default is no early stopping in stablebaselines (I had to go back several inherited modules in stablebaselines, this is quite the maze) you are right. The problem seems to be seomewhere else.

On the other hand, the learn method has a loop that says
for update in range(1, total_timesteps // self.n_batch + 1): which would not give a total of total_timesteps iterations unless self.n_batch divides total_timesteps. Maybe the problem comes from here ?

@mmcenta
Copy link
Collaborator Author

mmcenta commented Mar 15, 2022

On another note, I added a test to test tensorboard integration but the test seems to be failing when I try to import tensorboard. Should I remove it?

@TimotheeMathieu
Copy link
Collaborator

TimotheeMathieu commented Mar 15, 2022

I don't really know what we should do. i mean, tensorboard is not huge (5Mb) so I think we could just add it to the requirements and be done with it.
On the other hand, if we want to stay compatible with everyone we may encounter this problem again in the futur and it would not be good to always add requirements like that. One solution would be to have a directory for additional "extra" tests, these tests we could check on a parallel check of azure pipeline (have a pipeline on linux that checks only these tests in order not to have a too long running time for the checks, with its own additional dependencies that would not appear in requirements.txt).
Or we could also not test them at all, but still I think it is good to have these tests so we keep them, the only question is do we check them in azure pipeline or not ?

docs/api.rst Outdated Show resolved Hide resolved
@mmcenta
Copy link
Collaborator Author

mmcenta commented Mar 15, 2022

@TimotheeMathieu On another note, I think PR #144 broke my git pre-commit setup - now flake8 stops me because of the unused imports on the __init__.py files that used to be ignored. Do you know how to fix this?

@TimotheeMathieu
Copy link
Collaborator

No this is not due to PR #144, precommit already checked the __init__ before it is just that pre-commit only check the files that you modified for this commit so maybe you never did a commit where you modified a __init__ ?
But you are right that we should ignore the inits also in pre-commit. I think adding the --per-file-ignores="__init__.py:F401" works: in line 29 of pre-commit.yaml

       args: ['--select=F401,F405,D410,D411,D412', '--per-file-ignores="__init__.py:F401"']

but I did not test it.

Either way, you can say that you don't want pre-commit to check with git commit -n if you ever have this kind of problem later.

docs/api.rst Outdated Show resolved Hide resolved
@TimotheeMathieu
Copy link
Collaborator

No this is not due to PR #144, precommit already checked the __init__ before it is just that pre-commit only check the files that you modified for this commit so maybe you never did a commit where you modified a __init__ ?
But you are right that we should ignore the inits also in pre-commit. I think adding the --per-file-ignores="__init__.py:F401" works: in line 29 of pre-commit.yaml

       args: ['--select=F401,F405,D410,D411,D412', '--per-file-ignores="__init__.py:F401"']

but I did not test it.

Either way, you can say that you don't want pre-commit to check with git commit -n if you ever have this kind of problem later.

@mmcenta
Copy link
Collaborator Author

mmcenta commented Mar 15, 2022

I don't really know what we should do. i mean, tensorboard is not huge (5Mb) so I think we could just add it to the requirements and be done with it. On the other hand, if we want to stay compatible with everyone we may encounter this problem again in the futur and it would not be good to always add requirements like that. One solution would be to have a directory for additional "extra" tests, these tests we could check on a parallel check of azure pipeline (have a pipeline on linux that checks only these tests in order not to have a too long running time for the checks, with its own additional dependencies that would not appear in requirements.txt). Or we could also not test them at all, but still I think it is good to have these tests so we keep them, the only question is do we check them in azure pipeline or not ?

@omardrwch What do you think about this? I'm tempted to just add it to the requirements for now.

Also, I'll look into the batch_size problem that @TimotheeMathieu talked about. Maybe we can set batch_size=1 and solve the issues I was having, but I'll have to run some tests again.

@mmcenta
Copy link
Collaborator Author

mmcenta commented Mar 16, 2022

@TimotheeMathieu So I rerun some reproducibility experiments with the SB3 A2C algorithm and our new agent, here are the results:

Difference Experiments

In these experiments, I choose a batch size X and a budget difference Y and analyze the results for every (X, Y) pair. Additionally, you can specify the base budget B. In summary, this experiment:

  1. Creates two SB3 A2C agents with batch size X;
  2. Trains the first one with a budget of B on the "CartPole-v0" environment;
  3. Trains the second one with a budget of B + Y on the "CartPole-v0" environment;
  4. Collects trajectories of states from the first agent (128 steps total);
  5. Compares both agents on the collected trajectory, using the error rate (that is, the number of disagreements divided by the trajectory length).

The figure below shows the results with B = 0:

results_diff_0

Note that we would expect a significant error rate for all experiments where Y is not zero. For completeness, here are the results for B = 100 and B = 1000:

results_diff_100
results_diff_1000

As you can see, using a higher base budget only attenuates the errors.

Additive Fit Experiments

This experiment is similar to the previous setting. Instead of varying the budget difference, we vary the total budget T. This experiment:

  1. Creates two SB3 A2C agents with batch size X;
  2. Trains the first one with a budget of T the "CartPole-v0" environment;
  3. Trains the second one with a budget of T // 2, then with a budget of (T - (T // 2)) on the "CartPole-v0" environment;
  4. Collects a trajectories of states from the first agent (128 steps total);
  5. Compares both agents on the collected trajectory, using the error rate (that is, the number of disagreements divided by the trajectory length).

The results are shown below:

results_add_0

Note that we would expect all error rates to be zero since both agents are trained for the same budget. The additive fit seems to always commit some mistakes with T > 0 for all batch sizes. It seems like it is something we can't rely on, but if you have any further suggestions, please let me know.

Edit: I also uploaded my code for these results here.

@TimotheeMathieu
Copy link
Collaborator

I did some experiments and I think it may be because of reset_num_timesteps. With the following fit function:

    def fit(
        self,
        budget: int,
        tb_log_name: Optional[str] = None,
        reset_num_timesteps: bool = True,
        **kwargs,
    ):
        """Fit the agent.

        Note
        ----
        This method wraps the :code:`learn` method of the algorithm.
        Logging parameters are processered by rlberry in order to use the
        agent.writer.

        Parameters
        ----------
        budget: int
            Number of timesteps to train the agent for.
        tb_log_name: str
            Name of the log to use in tensorboard.
        reset_num_timesteps: bool
            Whether to reset or not the :code: `num_timesteps` attribute
        """
        # if not self.wrapped._custom_logger:
        #     if tb_log_name is None:
        #         tb_log_name = self.wrapped.__class__.__name__
        #     sb_logger = configure_logger(
        #         self._verbose, self._tb_log, tb_log_name, reset_num_timesteps
        #     )
        #     self.set_logger(sb_logger)
        self.wrapped.learn(total_timesteps=budget, reset_num_timesteps=False, **kwargs)

I obtain the following add graph:

results_add_0

and I guess this is OK, with batch size 1 the fit is additive (which is what we want).

@mmcenta
Copy link
Collaborator Author

mmcenta commented Mar 18, 2022

@TimotheeMathieu
I think you're right: I assumed that reset_num_timesteps is used only for logging purposes (as they mention in their docs), but it seems that it is used to possibly reset the environment too.
However, I can't comment out that part of the code because that's what integrates their logging to rlberry. I came up with a solution that seems to solve the issue - I'll push the changes as soon as I'm finished with testing.

@mmcenta
Copy link
Collaborator Author

mmcenta commented Mar 22, 2022

Just an update: I've been able to (partially) fix the additive fit issues for the agent.

I encountered one last problem: it seems that it is difficult to guarantee additive fits when a model is saved and then loaded during training. First, the standard .save() and .load() methods from StableBaselines don't support additive fits by themselves. By using .get_parameters() and .set_parameters() instead, I can get the additive fits to work in isolation, but the method breaks when I implement it in the agent - although it does seem some sort of approximation error since the predicted values are very close.

This is relevant because our AgentManager saves and loads agents all the time, so I can't guarantee additive fits in that case. More specifically, our additive fit test uses AgentManager's, so it still fails :(

Edit: I am also pretty sure that the AgentManager should not work for additive fits with agents that use external libraries (such as PyTorch) because it reseeds the external library at the beginning of each fit. (I created #157 to discuss this)

TimotheeMathieu added a commit to TimotheeMathieu/rlberry that referenced this pull request Mar 24, 2022
@omardrwch
Copy link
Member

@omardrwch What do you think about this? I'm tempted to just add it to the requirements for now.

Sorry, I'm a bit (very) late in this PR. Regarding the tensorboard requirement, I think it's ok to add it for now. I think the SB3 wrapper is a very important feature, so testing it should be worth the extra requirement. Later we could implement @TimotheeMathieu 's suggestion (of creating a directory for additional heavier tests) when needed (e.g. also for a more comprehensive testing of our agents, to make sure that they match some benchmarks).

Copy link
Collaborator

@TimotheeMathieu TimotheeMathieu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for me. We can make separate PRs for the tests and seeder later. Thanks.

@mmcenta mmcenta merged commit 847c54a into rlberry-py:main Apr 22, 2022
@mmcenta mmcenta deleted the sb-agent branch April 22, 2022 11:29
@TimotheeMathieu TimotheeMathieu moved this from To Do to Done in V 0.3.0 Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants