Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Results from get_last_reward() when Replaying LMP in Mode.PLAYER #412

Closed
mhe500 opened this issue Oct 20, 2019 · 8 comments
Closed
Labels

Comments

@mhe500
Copy link
Contributor

mhe500 commented Oct 20, 2019

Issue: Playback of .lmp using Mode.PLAYER results in game.get_last_reward() sometimes combining rewards of two time steps.

ViZDoom version 1.1.7
Platform: Ubuntu 18.04, Python 3.7.3

To reproduce:

This implies either a bug in the code, or that the documentation example which states that replay can be used in any mode should be updated to indicate Mode.SPECTATOR is required (and possibly a validation in the code that disallows replaying LMP files in Mode.PLAYER).

Separately, the example also states that game.get_last_action() is not supported for replay, which it is now.

Thank you!

Example output when using Mode.PLAYER:

REPLAY OF EPISODE
************************

State #1
Game variables: 50.0
Reward: -1.0
=====================
State #2
Game variables: 50.0
Reward: -2.0
=====================
State #3
Game variables: 50.0
Reward: -2.0
=====================
State #4
Game variables: 50.0
Reward: -2.0
=====================
...
=====================
State #299
Game variables: 50.0
Reward: -1.0
=====================
Episode finished.
total reward: -355.0
************************
@Miffyli
Copy link
Collaborator

Miffyli commented Oct 21, 2019

Nice catch. Tested on Windows 10, Python 3.6.8 and ViZDoom 1.1.8pre and I am getting same results (does not work with PLAYER mode, but works with SPECTATOR mode). The sleep trick from #354 does not help here, although they might be related.

@Miffyli Miffyli added the bug label Oct 21, 2019
@mhe500
Copy link
Contributor Author

mhe500 commented Oct 21, 2019

This feels like a race condition and I think my assertion that "it works in Mode.SPECTATOR" is not correct.

If I insert a sleep in the loop and print state.tic in addition to state.number, in Mode.SPECTATOR the results become very far off. Tic increments by more than 1 on each iteration, the loop runs fewer than 300 times and the rewards are duplicated:

Example output with Mode.SPECTATOR when adding sleep(0.1) into the loop and also printing the state.tic.

...
=====================
State #80 state.tic=289
Game variables: 50.0
Reward: -5.0
=====================
State #81 state.tic=293
Game variables: 50.0
Reward: -4.0
=====================
State #82 state.tic=296
Game variables: 50.0
Reward: -5.0
=====================
State #83 state.tic=300
Game variables: 50.0
Reward: -4.0
=====================
State #84 state.tic=303
Game variables: 50.0
Reward: -5.0
=====================
State #85 state.tic=307
Game variables: 50.0
Reward: -4.0
=====================
Episode finished.
total reward: -380.0
************************

However, if I do the same in Mode.player, the result is still wrong but it's "less wrong" since I get about the correct number of iterations (300) and state.tic seems to increment by about 1 each time, though sometimes by +2 and sometimes by 0.

Net, net, this feels like a race condition, but the effects are somewhat more limited in Mode.PLAYER. Though the living reward is effectively doubled. In terms of training this means that transitions from an expert trajectory replayed using this method actually look less appealing than those found randomly!

@Miffyli
Copy link
Collaborator

Miffyli commented Oct 21, 2019

Thanks for pointing this out! I first found it weird SPECTATOR mode worked better since it was designed to be used by human players (and hence closer to asynchronous behavior than synchronized). I will fix the FAQ according to this discovery :)

@mhe500
Copy link
Contributor Author

mhe500 commented Oct 21, 2019

Well to be clear: Mode.SPECTATOR works better only when the Python script's loop is able to run quickly enough -- put in a sleep and the result is worse than with Mode.PLAYER. In all cases I think there is a race condition and neither SPECTATOR nor PLAYER is producing the right result.

I've experimented a bit and if I change ViZDoomGame.cpp as follows:

diff --git a/src/lib/ViZDoomGame.cpp b/src/lib/ViZDoomGame.cpp
index 54d1764..cae7529 100644
--- a/src/lib/ViZDoomGame.cpp
+++ b/src/lib/ViZDoomGame.cpp
@@ -215,7 +215,10 @@ namespace vizdoom {
         this->summaryReward += reward;
         this->lastReward = reward;
 
-        this->lastMapTic = this->doomController->getMapTic();
+        if (this->doomController->isRunDoomAsync())
+            this->lastMapTic = this->doomController->getMapTic();
+        else
+            this->lastMapTic = this->doomController->getMapLastTic();
 
         /* Update state */
         if (!this->isEpisodeFinished()) {

My belief is that for synchronous cases, it seems more correct for the DoomGame to get the last map tic executed by the user of the DoomGame using the value that the DoomController holds, rather than using DoomController::getMapTic which queries shared memory that may have been updated by Doom running in a separate thread.

In testing of cases of human game play (SPECTATOR), .lmp playback (PLAYER) and agent play (PLAYER) I get the results I expect. However, given I don't really know the internals of ViZDoom, I'm not really sure whether this is a reasonable fix or not, but for now I will try this as a workaround.

@Miffyli
Copy link
Collaborator

Miffyli commented Oct 21, 2019

In the words of @mwydmuch , the whole project is a hack on top of ZDoom so solutions tend not to be too pretty. If this solution works reliably for you, it would be quite dandy if you could make a PR out of it :)

@mhe500
Copy link
Contributor Author

mhe500 commented Oct 21, 2019

Let me play with it a bit and if it works I will gladly submit a PR. Thanks!

@mhe500
Copy link
Contributor Author

mhe500 commented Oct 21, 2019

One more update:

I think the below is an additional problem. The missing waitForDoomWork() causes the controller and VizDoom to become out of sync and subsequent TIC/UPDATE/TIC+UPDATE requests from the controller receive an acknowledgement that should have been paired with the prior message.

The two changes I made seem to be working well for me but needs additional testing.

diff --git a/src/lib/ViZDoomController.cpp b/src/lib/ViZDoomController.cpp
index daf6bba..faafbc8 100644
--- a/src/lib/ViZDoomController.cpp
+++ b/src/lib/ViZDoomController.cpp
@@ -441,6 +441,7 @@ namespace vizdoom {
             // Workaround for some problems
             this->sendCommand(std::string("map ") + this->map);
             this->MQDoom->send(MSG_CODE_TIC);
+            this->waitForDoomWork();
 
             this->sendCommand(std::string("playdemo ") + prepareLmpFilePath(demoPath));

mhe500 added a commit to multitude-ai/ViZDoom that referenced this issue Oct 23, 2019
Else DoomController becomes out of sync with Doom.
See: Farama-Foundation#412
mhe500 added a commit to multitude-ai/ViZDoom that referenced this issue Oct 23, 2019
@mwydmuch
Copy link
Member

mwydmuch commented Nov 3, 2019

Fixed by #415

@mwydmuch mwydmuch closed this as completed Nov 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants