Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce performance evaluation #288

Merged
merged 20 commits into from
Dec 27, 2023

Conversation

BartekCupial
Copy link
Collaborator

@BartekCupial BartekCupial commented Dec 14, 2023

Overview

This PR introduces a new script, eval.py, designed for faster evaluation using multiple environments and leveraging the efficiency of the Sample Factory sampler. Additionally, an example usage of eval.py is demonstrated in the added eval_mujoco.py script within sf_examples.

Key Changes

  • New Script: eval.py

    • A fast evaluation script similar to enjoy.py.
    • Utilizes multiple environments for improved speed.
  • Example Usage: eval_mujoco.py

    • Demonstrates how to use the new eval.py script.
  • Core Logic in evaluation_sampling_api.py

    • Heavily inspired by simplified_sampling_api.py.
    • Main differences is usage of msg_handlers, episode message processing, and loading pretrained checkpoints.
  • Additional features

    • Because during the evaluation we might want to use different arguments then during training I've added checkpoint_override_defaults. Has the same purpose as load_from_checkpoint(cfg), but keeps the option of overriding cfg with argv.
    • Added EpisodeCounterWrapper, which counts the episodes for each environment. Used during eval in order not to bias the results with shorter episodes.
    • Save the eval results to a csv file

@BartekCupial
Copy link
Collaborator Author

I'm getting this message when the script finishes, not sure yet why.

[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

self.sampling_loop: SamplingLoop = SamplingLoop(self.cfg, self.env_info)
# don't pass self.param_servers here, learners are normally initialized later
# TODO: fix above issue
self.sampling_loop.init(self.buffer_mgr)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I intended to also pass self.param_servers to self.sampling_loop.init(self.buffer_mgr). Unfortunately I've got some errors related to pilcking the ActorCritic. I think my approach to loading the models from checkpoints could be improved and simplified, but I don't know how could I achieve that. I'd be grateful for any suggestions.

The error looks like this

  File "/home/bartek/anaconda3/envs/sf_nethack/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
RuntimeError: Tried to serialize object __torch__.sample_factory.algo.utils.running_mean_std.RunningMeanStdInPlace which does not have a __getstate__ method defined!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have to look into this.
Perhaps it's the issue with a PyTorch jitted module. A solution might be to just pass the parameters, not the actor-critic as a whole. Not sure what the best approach is

@BartekCupial
Copy link
Collaborator Author

@alex-petrenko can I ask for the review, please?

@codecov-commenter
Copy link

codecov-commenter commented Dec 27, 2023

Codecov Report

Attention: 318 lines in your changes are missing coverage. Please review.

Comparison is base (6379cf9) 79.52% compared to head (a5fde39) 76.32%.

Files Patch % Lines
...e_factory/algo/sampling/evaluation_sampling_api.py 0.00% 194 Missing ⚠️
sample_factory/eval.py 0.00% 84 Missing ⚠️
sf_examples/mujoco/fast_eval_mujoco.py 0.00% 16 Missing ⚠️
sample_factory/cfg/arguments.py 7.14% 13 Missing ⚠️
sample_factory/envs/env_wrappers.py 28.57% 10 Missing ⚠️
sample_factory/envs/create_env.py 66.66% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #288      +/-   ##
==========================================
- Coverage   79.52%   76.32%   -3.20%     
==========================================
  Files          97      100       +3     
  Lines        7517     7845     +328     
==========================================
+ Hits         5978     5988      +10     
- Misses       1539     1857     +318     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@klyuchnikova-ana klyuchnikova-ana merged commit 314c9fe into alex-petrenko:master Dec 27, 2023
8 of 10 checks passed
@klyuchnikova-ana
Copy link
Collaborator

I'm getting this message when the script finishes, not sure yet why.

[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Yes, this started to happen after some PyTorch update. I've been meaning to investigate, but I also found that it's totally harmless and in practice can be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants