-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Probably) Unfair Speed Comparison #249
Comments
Hello Antonin, Thank you for reporting this issue. I very appreciate your excellent work on SB and SB3.
All of the above tests are run on single-thread, with i7-8750H + GTX1060 (driver version 440.64) + PyTorch 1.4.0 + TensorFlow 1.14.0 + CUDA 10.0 + UBUNTU 18.04 + Python 3.6.9 I did not align each of the repo to exactly the same set of hyperparameter, and there are some reasons:
However, as you can see, we do not update this result these months. Since there is no free lunch, supporting other user-friendly features such as async-vecenv, multi-level dict state would slower the existing functionality. I choose the latter but tried to find the balance between various feature requests, performance concerns, and user experience (such as #189). Best, |
Thanks =)
You are still not providing any code to reproduce those (a colab notebook would be best as I did in my first message ;)).
well, you should compare apples to apples*, otherwise showing a comparison table is at least misguiding, at worst dishonest. This is even more true as you advertising tianshou as "superfast". *at least in term of number of envs used for training, which is in my opinion the main factor in training speed.
I would agree with that for algorithms that require special care (A2C, PPO) but not for off-policy ones (SAC, TD3, DDPG).
I don't dispute that feature. The fact that tianshou supports dict and multiprocessing for all algorithms is a cool thing* and must be highlighted! But instead of showing a features table, you decided to show a speed comparison table which apparently does not compare the same things. So my main concern is the way things are presented. *in fact, this is in our roadmap too, see issues DLR-RM/stable-baselines3#179 and DLR-RM/stable-baselines3#216
Well, for some codebases at least (baselines, stable-baselines), it is easy to implement. And I know this is work (i'm currently checking SB3 performance vs reference implementations), but this is science, you must be fair if you compare things. Also, if you provide code to reproduce results, other maintainers will come and help you to do that fair comparison ;).
well if all other codebases were tested with one env, then you must for sure add a column with one env for tianshou. |
The code is exactly under the
I agree with this point. But at that time I thought tianshou used more computation but within a shorter time, and others use less computation and a longer time, is this unfair to tianshou to some extend...? If we make sure all of these use exactly the same amount of computational resources, the result will be much worse. Let's discuss the solution here:
I'm very appreciate your insightful discussion! |
I meant the code for benchmarking the other libraries.
I'm not sure to get your point. Could you elaborate?
sounds good ;) and probably add a note explaining the 2nd column for tianshou (if you keep it) and specifying the number of parallel env used. |
For SB, I use your rl-baselines-zoo as recommended in the readme of stable-baselines, for example, python3 train.py --algo ddpg --env Pendulum-v0 --seed 4 --eval-episodes 10 I add the timer in
Effectiveness: longer time, lower computation < longer time, higher computation / shorter time, shorter computation < shorter time, higher computation. I directly compared the first item and the last item at that time. |
oh, I see... well this code does something completely different as what is mentioned in the readme table: And here, those are old defaults (2e5 timesteps, whereas you can solve this env with DDPG in 2e4 steps by setting the learning rate to 1e-3), that's why I was advocating to use the same hyperparams at least for off-policy algorithms.
yes and no, I would separate the evaluation time from the training time (or again, you need to use the same number of tests environments when comparing*, otherwise you compare different things). *parallel tests env is currently not implemented in SB3 though... but I can write a quick and dirty version which run at least n_eval_episodes
I would agree if we were comparing different algorithms and different hyperparameters. For instance, with off-policy algorithms, you can do more gradients steps than steps that was done in the environment, and here you have a tradeoff between computation and sample efficiency. The same goes when using more than one environment to train (we have a notebook about that tradeoff): But if you run two implementations of the same algorithm with the same number of training environments, you should be able to have a fair comparison ;) |
I indeed added extra callback in @@ -293,8 +310,10 @@ if __name__ == '__main__':
if args.verbose > 0:
print("Creating test environment")
- save_vec_normalize = SaveVecNormalizeCallback(save_freq=1, save_path=params_path)
- eval_callback = EvalCallback(create_env(1, eval_env=True), callback_on_new_best=save_vec_normalize,
+ # save_vec_normalize = SaveVecNormalizeCallback(save_freq=1, save_path=params_path)
+ callback_on_best = StopTrainingOnRewardThreshold(
+ reward_threshold=195, verbose=1)
+ eval_callback = EvalCallback(create_env(1, eval_env=True), callback_on_new_best=callback_on_best,
best_model_save_path=save_path, n_eval_episodes=args.eval_episodes,
log_path=save_path, eval_freq=args.eval_freq)
callbacks.append(eval_callback) and made the test.
This is done. For the rest, I'll do a comprehensive test in the winter break because I'm currently busy with my final exam and course projects :( |
Linking issues related to that comparison:
|
BTW, could you please fix sb2's ci pipeline to let the badge look nicer? |
As a small remark, the true number of algorithms implemented is 9 for SB3 (you have to include TQC / QR-DQN from the contrib repo: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib) |
Hello,
You advertise tianshou as being fast and provide in the readme a comparison table.
However, no reference code is linked to reproduce the results.
So, I decided to create a colab notebook to have fair comparison (same hyperparameters) between tianshou, Stable-Baselines and Stable-Baselines3.
On the two environments tested (Pendulum and Cartpole), the results are quite far from those reported for Stable-Baselines...
Even worse, when compared to SB2 on the Pendulum environment, Tianshou seems slower (and seems to give worse final performance).
Time to reach mean reward of -200 for TD3 on Pendulum-v0 (1 env):
SB2: ~30s (vs 99s in the readme)
SB3: ~70s
Tianshou: >110s (vs 44s in the readme)
You can find the notebook here
(Please check tianshou code, I'm not 100% sure that I re-used the same hyperparameters)
The text was updated successfully, but these errors were encountered: