Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing the paper result of doom_deadly_corridor in VizDoom #296

Open
sjaelee25 opened this issue Mar 10, 2024 · 5 comments
Open

Reproducing the paper result of doom_deadly_corridor in VizDoom #296

sjaelee25 opened this issue Mar 10, 2024 · 5 comments

Comments

@sjaelee25
Copy link

I have tried to reproduce the experimental result for Deadly Corridor in VizDoom basic tasks.

the command is referred as in documentation https://www.samplefactory.dev/09-environment-integrations/vizdoom/#reproducing-paper-results
python -m sf_examples.vizdoom.train_vizdoom --train_for_env_steps=500000000 --algo=APPO **--env=doom_deadly_corridor** --env_frameskip=4 --use_rnn=True --num_workers=36 --num_envs_per_worker=8 --num_policies=1 --batch_size=2048 --wide_aspect_ratio=False --experiment=doom_basic_envs

However, the performance is like
image
while the paper result is as follows
image

Could you check this issue? Thanks!

@alex-petrenko
Copy link
Owner

Hi @sjaelee25

I think you might be dealing with two separate issues here.

  1. The reward scale reported in the paper is most likely matching the baseline "A2C" paper. I'm guessing you're training with reward scale 0.1 so you're seeing very different results because WandB logs metrics as observed by the learning algorithm after reward scaling, not the original env rewards. Check your configs for reward scaling, I'm not sure where this is added, last time I looked was ~3 years ago.

  2. as far as I remember, "Deadly Corridor" is a hard exploration environment and standard RL algorithms like PPO don't do very well on this. It's not very good at measuring relative performance of different RL algorithms because there's a lot of variance and it's susceptible to initial conditions very much.

A good test to whether your agent can learn anything better: look at reward_max. Looks like under the random policy initially the agent almost never sees higher rewards compared to what it eventually gets. RL can't learn a behavior it's never able to observe.

Why there's a discrepancy between the current version of the codebase and 2020 version when this experiment was done, I can't say. 1000 different things were changed between then and now. But I remember this scenario was always very high variance.
If you want a simple test, use doom_basic or defend_the_center.
If you want a challenging environment where you can compare different ideas, use battle or battle2.
If you want this one to reliably produce good results, you'll probably have to spend some time on this specifically, add some reward shaping, or some exploration heuristic.

@alex-petrenko
Copy link
Owner

Also I recommend playing the scenario by hand to be able to see what kinds of rewards are achievable with human-level play. This will give you better idea of what's possible and what the agent needs to do.

@sjaelee25
Copy link
Author

sjaelee25 commented Mar 12, 2024

Thank you for your reply and advice!

As you mentioned, the reward scale for deadly corridor is set to 0.01, while other Vizdoom basic tasks have a scale of 1.0. However, changing the scale to 1.0 or 0.1 does not seem to affect the results significantly, and the variance is also very low when using multiple different random seeds.

While other tasks are also considered, I would like to demonstrate the advantages of my proposed method using deadly_corridor. I apologize for asking about code that developed over 3 years ago, but I am also currently curious of reward shaping (not scaling).
For tasks such as health gathering and battle, there are separate reward shaping functions. I wonder if deadly corridor also had an additional reward function that was removed or such case that reward != episode return.

Thank you again!

@alex-petrenko
Copy link
Owner

I don’t think there was any special reward shaping function. You can check the codebase release version for icml2020 to make sure

this is an exploration task and it’s just very sensitive to initial conditions. Most likely your agent hits an early local minimum and is unable to improve

@sjaelee25
Copy link
Author

Okay, thank you for your reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants