Introduce performance evaluation #288

BartekCupial · 2023-12-14T15:15:29Z

Overview

This PR introduces a new script, eval.py, designed for faster evaluation using multiple environments and leveraging the efficiency of the Sample Factory sampler. Additionally, an example usage of eval.py is demonstrated in the added eval_mujoco.py script within sf_examples.

Key Changes

New Script: eval.py
- A fast evaluation script similar to enjoy.py.
- Utilizes multiple environments for improved speed.
Example Usage: eval_mujoco.py
- Demonstrates how to use the new eval.py script.
Core Logic in evaluation_sampling_api.py
- Heavily inspired by simplified_sampling_api.py.
- Main differences is usage of msg_handlers, episode message processing, and loading pretrained checkpoints.
Additional features
- Because during the evaluation we might want to use different arguments then during training I've added checkpoint_override_defaults. Has the same purpose as load_from_checkpoint(cfg), but keeps the option of overriding cfg with argv.
- Added EpisodeCounterWrapper, which counts the episodes for each environment. Used during eval in order not to bias the results with shorter episodes.
- Save the eval results to a csv file

…_stats

… need those, makes the code even faster

BartekCupial · 2023-12-14T16:29:26Z

I'm getting this message when the script finishes, not sure yet why.

[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

BartekCupial · 2023-12-14T16:35:44Z

sample_factory/algo/sampling/evaluation_sampling_api.py

+        self.sampling_loop: SamplingLoop = SamplingLoop(self.cfg, self.env_info)
+        # don't pass self.param_servers here, learners are normally initialized later
+        # TODO: fix above issue
+        self.sampling_loop.init(self.buffer_mgr)


Originally I intended to also pass self.param_servers to self.sampling_loop.init(self.buffer_mgr). Unfortunately I've got some errors related to pilcking the ActorCritic. I think my approach to loading the models from checkpoints could be improved and simplified, but I don't know how could I achieve that. I'd be grateful for any suggestions.

The error looks like this

File "/home/bartek/anaconda3/envs/sf_nethack/lib/python3.10/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) RuntimeError: Tried to serialize object __torch__.sample_factory.algo.utils.running_mean_std.RunningMeanStdInPlace which does not have a __getstate__ method defined!

I'd have to look into this.
Perhaps it's the issue with a PyTorch jitted module. A solution might be to just pass the parameters, not the actor-critic as a whole. Not sure what the best approach is

BartekCupial · 2023-12-27T07:26:58Z

@alex-petrenko can I ask for the review, please?

codecov-commenter · 2023-12-27T10:46:38Z

Codecov Report

Attention: 318 lines in your changes are missing coverage. Please review.

Comparison is base (6379cf9) 79.52% compared to head (a5fde39) 76.32%.

Files	Patch %	Lines
...e_factory/algo/sampling/evaluation_sampling_api.py	0.00%	194 Missing ⚠️
sample_factory/eval.py	0.00%	84 Missing ⚠️
sf_examples/mujoco/fast_eval_mujoco.py	0.00%	16 Missing ⚠️
sample_factory/cfg/arguments.py	7.14%	13 Missing ⚠️
sample_factory/envs/env_wrappers.py	28.57%	10 Missing ⚠️
sample_factory/envs/create_env.py	66.66%	1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #288      +/-   ##
==========================================
- Coverage   79.52%   76.32%   -3.20%     
==========================================
  Files          97      100       +3     
  Lines        7517     7845     +328     
==========================================
+ Hits         5978     5988      +10     
- Misses       1539     1857     +318

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

klyuchnikova-ana · 2023-12-27T12:21:59Z

I'm getting this message when the script finishes, not sure yet why.
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Yes, this started to happen after some PyTorch update. I've been meaning to investigate, but I also found that it's totally harmless and in practice can be ignored.

BartekCupial added 16 commits December 14, 2023 15:31

episode counter

67fab9d

create skeleton for evaluation with async sampling

f3d3922

create eval script with the global purpose for every environment

07e6b81

final cleanup for eval

0235909

move some code to EvalSamplingApi.init

52e68e1

save csv in train dir

2c07e06

simplify eval even more, skip adding the invalid data into the policy…

abb859f

…_stats

remove returning the trajectories in eval sampling API since we don't…

fba1af2

… need those, makes the code even faster

add csv_folder_name

f66c1d6

pre-commit

257270d

move sample_eval_episodes to global cfg

a31b4d1

fix pickling issue

5b88c54

turn off decorrelation be default in eval

f3e3c1d

improve logging

599da2e

add eval mujoco for testing

12aa03c

add example usage of perf_eval to readme

07b5a33

BartekCupial commented Dec 14, 2023

View reviewed changes

A K added 2 commits December 27, 2023 02:18

Merge branch 'master' into perf_eval

277aa68

Merge branch 'master' into perf_eval

cedcc66

A K added 2 commits December 27, 2023 03:23

Mujoco troubles

4ef7401

Minor fixes, added missing dependencies, fixed some PEP complaints

a5fde39

klyuchnikova-ana approved these changes Dec 27, 2023

View reviewed changes

klyuchnikova-ana merged commit 314c9fe into alex-petrenko:master Dec 27, 2023
8 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce performance evaluation #288

Introduce performance evaluation #288

BartekCupial commented Dec 14, 2023 •

edited

Loading

BartekCupial commented Dec 14, 2023

BartekCupial Dec 14, 2023

klyuchnikova-ana Dec 27, 2023

BartekCupial commented Dec 27, 2023

codecov-commenter commented Dec 27, 2023 •

edited

Loading

klyuchnikova-ana commented Dec 27, 2023

Introduce performance evaluation #288

Introduce performance evaluation #288

Conversation

BartekCupial commented Dec 14, 2023 • edited Loading

Overview

Key Changes

BartekCupial commented Dec 14, 2023

BartekCupial Dec 14, 2023

Choose a reason for hiding this comment

klyuchnikova-ana Dec 27, 2023

Choose a reason for hiding this comment

BartekCupial commented Dec 27, 2023

codecov-commenter commented Dec 27, 2023 • edited Loading

Codecov Report

klyuchnikova-ana commented Dec 27, 2023

BartekCupial commented Dec 14, 2023 •

edited

Loading

codecov-commenter commented Dec 27, 2023 •

edited

Loading