-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collector sampling with multiple environment does not seem to be unbiased with n_episodes #1042
Comments
Thank you for the detailed description, I'm gonna think about this tomorrow and give a proper answer. But already now the situation you describe seems unusual - during training you would usually parallelize different instantiations of the same environment type. Why would part of them be slower than the others? Do you have a specific use-case in mind? |
Actually, I thought to give a deterministic example to make the issue clearer, but I now realize that was not necessary and might have caused confusion. In my specific situation, the episode lengths can vary a lot between episodes. Different environments are not inherently different, but if some envs get stuck with long episodes, then this problem can arise. I should have started with this example, my apologies. Hope it is clearer now. |
Hi @utkarshp, thanks for the issue. We recently had a discussion around this very topic and currently I believe that if all of your envs are created equal (i.e., the length of the environment is just due to some randomness), then you are simply not collecting enough episodes to capture the variance between short and long episodes of your task. Deliberately stopping the first envs and "handpicking" some of the longer runs might actually skew the statistics in the same way as collecting too few episodes and thus only end up with short ones. The solution would be to either reduce the number of envs to collect more episodes per env, or increase the number of collected episodes. Please let me know if you have other reasons why you would like to have each env sample two episodes. We will also perform some testing whether there are significant differences if episodes come only from a small subset of available envs, or equally from all envs. |
The only reason is time. When I am training, I am collecting lots of episodes from each env, and this is not such an issue. When testing, I want to be able to test on the same test set each time I test. Furthermore, I have a baseline algorithm that also needs to run on the same test set so that I can compare it with the RL agent's performance. I have 100 pre-selected episodes (=seeds) that I want to run all tests on, and my foremost requirement is that each of these episodes be evaluated once so that a fair comparison can be made with the baseline algorithm, and past iterations. Of course I plan on trying with different test sets for robustness, but currently I am only looking for reproducible behavior on a fixed test set. Since there is no good way (that I know of) to remove envs that are done with their episodes from the rotation, I elected to divide the episodes equally into different envs, hoping that it would work. Is there a better way to achieve this? Anyway, each episode takes around 10-15 minutes. If I parallelize into 50 threads, I am done in 30 minutes for each test. Otherwise, it takes longer. I have tried with 20-25 envs (with 5-4 episodes each), and this is still an issue. I suppose I could also try 10 envs with 10 episodes each and reduce my testing frequency. In all, I still think that enforcing an equal number of episodes would be a desirable behavior. But now I think perhaps this could be a feature request. |
A bit unrelated perhaps, but I guess my problem can also be solved through some sort of interprocess communication between my test envs. Currently I pass episode start seeds at test env init so that each env can execute the episodes they are assigned. Once an env is done with its two episodes, the next reset and step produce garbage that I don't want. If the processes could communicate, then the faster envs can take over the load from the slower envs and things could still be made to work. While this seems to be more complicated and prone to errors, it may be worth exploring in case there is a sufficient speedup. |
Hi @utkarshp, Using the sequential execution where each env has to terminate in each batch is as you rightly point out unbiased. It might be computationally quite undesirable through. Consider for example an env that takes in 90% of cases t steps and T steps otherwise. In the batch sequential for 10 test env with 10 runs each, the batch sequential run time is 10T, whereas there is a good chance of T + 9t in the asynchronous case. As your use case is reproducibility at test time instead of performance, the batch sequential fashion looks alright through. On a related note, do you find substantially different behaviour for running each of the 100 test env seeds one time vs. performing 100 runs in one test env (passing only the seed for the first reset, fixing the sequence of random numbers/ starting states for the subsequent envs as well )? |
Yes, I agree with the computational toll this would take. Perhaps the collector could be given an additional non-default argument (unbiased=False) that can be set to True to ensure this behavior? I am not entirely sure I fully understand your question, but I will try to answer. With the current implementation, yes I do find substantially different behavior in the following two cases:
In the second case, I find that some of the envs are done with their allotted episodes and start on newer ones. So I get some unwanted episodes, but miss out on those that I wanted. Please let me know if this answers your question. |
Agree, it will be a non default argument. Yes, I see how you would get different results in the two cases you describe. What I meant is slightly different, let me try to be more clear. By passing start states, you mean passing seeds? Could you please elaborate a bit on the environment you are using (like deterministic or stochastic, has the seed only impact on the initial state distribution or also sth. else, how different are the starting states from the initial state distribution)? The two cases I meant: For a sufficiently large number of runs, both approaches should converge to the same evaluation statistics. |
My env is deterministic, so that taking the same action in the same state would lead to the same reward and state transition. So passing 100 start states would be the same as passing 100 seeds and then reseeding with the next seed at reset time. I agree that the situations you described should converge to substantially the same evaluation statistic. The problem in my case is that for any policy running on a set of episodes, the rewards tend to have a really large standard deviation. Comparing different policies on different sets of episodes is then a challenge since their confidence intervals tend to overlap. However, if I can show that for the same set of episodes, one policy is better than the other in terms of mean rewards, and that this holds for multiple sets of episodes, then I have a concrete result about which policy is better. Does this make sense? Please do let me know if you think there is a flaw in my reasoning. |
We'll have to do #1058 first. The Collector is too convoluted to be touched, so we'll improve the situation in multiple steps |
I have observed that when Collector is inited with n_episodes, it tends to sample more from environments which take less steps to finish compared to others. There is some logic implemented in the collect method, but it does not seem to be doing enough. Here is the code that tries to deal with this by removing some envs:
data/collector.py lines 367--373 in master branch:
Here is the problem: Suppose we have 50 envs, running 100 episodes. Ideally, I would want each env to be stepped through two episodes regardless of their lengths. Now suppose 30 of these 50 envs are slow and take 31 steps each, while the remaining 20 are fast take 10 steps each to finish. After 10 steps, we get episode_count = 20 => surplus_env_num < 0, and nothing changes. All envs are stepped through another episode each, resulting in episode_count = 40 => surplus_env_num < 0. The same thing happens again, and we finally get episode_count = 60 => surplus_env_num > 0. At this point, we have already gathered a total of 60 episodes from the fast envs (3 episodes each), but no episode from the slow envs has finished. In the end, we would get 3 episodes each from the fast envs, 1-2 episodes each from the slower envs.
I made an attempt at fixing this as follows. Essentially, this attempt makes the envs that finished wait on those that have not, and then calls reset on all at once.
The text was updated successfully, but these errors were encountered: