Possible inconsistencies with the PPO implementation #477

hexonfox · 2024-08-02T08:27:28Z

Problem Description

I tested different implementations of the PPO algorithm and found some discrepancies among the implementations. I tested each implementation on 56 Atari environments, with five trials per (implementation, environment) permutation.

Checklist

I have installed dependencies via poetry install (see CleanRL's installation guideline.
I have checked that there is no similar issue in the repo.
I have checked the documentation site and found not relevant information in GitHub issues.

Current Behavior

The table below depicts an environment-wise one-way ANOVA to determine the effect of implementation source on mean reward. Out of the 56 environments tested, the implementations significantly differed in nine environments, as seen in the table with respect to Stable Baselines3, CleanRL, and Baselines (not the 108 variant).

Expected Behavior

The implementations should not significantly differ in terms of mean reward.

Possible Solution

I believe that there are inconsistencies among the implementations which causes the observed environment-dependent discrepancies. For example, I found some inconsistencies (i.e., a bug) with Baselines' implementation where the frames per episode did not conform to 108K as per the v4 ALE specification, causing mean rewards to differ significantly in some environments. After correcting this, three out of nine environments previously flagged as statistically different were now not different, as seen in the table above with Baselines108. The inconsistency is likely to be related to the environments, so I would suggest starting with parts of the implementation which might affect a subset of environments (similar to the frames per episode).

Steps to Reproduce

I used ppo_atari.py and followed the same PPO hyperparameters (without LSTM) as discussed in the ICLR Blog by @vwxyzjn

# Environment
Max Frames Per Episode = 108000
Frameskip = 4
Max Of Last 2 Frames = True
Max Steps Per Episode = 27000
Framestack = 4

Observation Type = Grayscale
Frame Size = 84 x 84

Max No Operation Actions = 30
Repeat Action Probability = 0.0

Terminal On Life Loss = True
Fire Action on Reset = True
Reward Clip = {-1, 0, 1}
Full Action Space = False

# Algorithm
Neural Network Feature Extractor = Nature CNN
Neural Network Policy Head = Linear Layer with n_actions output features
Neural Network Value Head = Linear Layer with 1 output feature
Shared Feature Extractor = True
Orthogonal Initialization = True
Scale Images to [0, 1] = True
Optimizer = Adam with 1e-5 Epsilon

Learning Rate = 2.5e-4
Decay Learning Rate = True

Number of Environments = 8
Number of Steps = 128
Batch Size = 256
Number of Minibatches = 4
Number of Epochs = 4
Gamma = 0.99
GAE Lambda = 0.95
Clip Range = 0.1
VF Clip Range = 0.1
Normalize Advantage = True
Entropy Coefficient = 0.01
VF Coefficient = 0.5
Max Gradient Normalization = 0.5
Use Target KL = False
Total Timesteps = 10000000
Log Interval = 1
Evaluation Episodes = 100
Deterministic Evaluation = False

Seed = Random
Number of Trials = 5

The text was updated successfully, but these errors were encountered:

pseudo-rnd-thoughts · 2024-08-02T08:47:07Z

Wow impressive, but I'm a bit confused what you would like to do.
You note that this is with different seeds, could this explain the difference?
Have you been able to find any implementation details that could cause these issues?

hexonfox · 2024-08-02T09:59:38Z

@pseudo-rnd-thoughts Thank you for the prompt response! I would like to work together and help to determine the cause of these discrepancies, possibly making the implementation by CleanRL more consistent as a result :) I don't think the seeds caused these differences, because the other 50 environments had random seeds as well and they were seen to be statistically consistent as seen in the table below. If seeds were an issue, it probably would have impacted more than just six environments in the table above. I've not been able to find the cause for these inconsistencies yet. Was wondering if you had any suggestions?

sdpkjc · 2024-08-02T10:22:31Z

The ppo implementations of cleanrl and sb3 are indeed inconsistent, and at least one difference I understand is the handling of truncation. sb3 fixes mishandling of environment truncation in openai baselines, while cleanrl keeps this issue. But for atari envs, I'm not sure how big of an impact that is.

See 👇

Enhancing Termination and Truncation Handling in CleanRL's PPO Algorithm #448
Add timeout handling for on-policy algorithms DLR-RM/stable-baselines3#658
[Bug] Infinite horizon tasks are handled like episodic tasks DLR-RM/stable-baselines3#284

hexonfox · 2024-08-02T10:56:20Z

@sdpkjc Thanks for the suggestions! It's surprising that such inconsistencies exist. Will look into it and determine if that is really the cause of the discrepancies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible inconsistencies with the PPO implementation #477

Possible inconsistencies with the PPO implementation #477

hexonfox commented Aug 2, 2024

pseudo-rnd-thoughts commented Aug 2, 2024

hexonfox commented Aug 2, 2024

sdpkjc commented Aug 2, 2024

hexonfox commented Aug 2, 2024 •

edited

Loading

Possible inconsistencies with the PPO implementation #477

Possible inconsistencies with the PPO implementation #477

Comments

hexonfox commented Aug 2, 2024

Problem Description

Checklist

Current Behavior

Expected Behavior

Possible Solution

Steps to Reproduce

pseudo-rnd-thoughts commented Aug 2, 2024

hexonfox commented Aug 2, 2024

sdpkjc commented Aug 2, 2024

hexonfox commented Aug 2, 2024 • edited Loading

hexonfox commented Aug 2, 2024 •

edited

Loading