Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Is synchronous play (Mode.PLAYER) possible in netgames? #417

Closed
mhe500 opened this issue Nov 5, 2019 · 11 comments
Closed

Question: Is synchronous play (Mode.PLAYER) possible in netgames? #417

mhe500 opened this issue Nov 5, 2019 · 11 comments

Comments

@mhe500
Copy link
Contributor

mhe500 commented Nov 5, 2019

Is it possible to play multiplayer games in fully synchronous mode?

The examples imply that Mode.ASYNC_PLAYER is necessary. In my experience the game will lock up and end up eating 100% CPU in d_net.cpp:TryRunTics() (at "wait for new tics if needed"). From the code it appears there was an attempt to make this work ("if(*viz_controlled && !*viz_async && netgame){...").

My question is whether or not this is possible or anyone has made it work? I've worked through my own code and believe I am stepping/resetting all my environments in lockstep and this does work for some number of tics, but then freezes.

Thank you!
mhe

@Miffyli
Copy link
Collaborator

Miffyli commented Nov 6, 2019

See the discussion had in #391 .

Using sync mode should be possible (I have not tried newest ViZDoom version yet, though), but there are few quirks mentioned in the issue that can cause deadlocking and whatnot. IIRC these issues have more to do with the underlying network code of ZDoom rather than ViZDoom API, so fixing them might be challenging (see #228)

@mhe500
Copy link
Contributor Author

mhe500 commented Nov 6, 2019

Thanks for the pointer. I am using a frame-skip of 4, and I now realize that actually may be putting the doom engine out of sync -- so my environments aren't really in "lock-step" as I'd thought.

I will try the workaround using "update state" and see if I can make it work that way.

Thank you.

@mhe500 mhe500 closed this as completed Nov 6, 2019
@mhe500
Copy link
Contributor Author

mhe500 commented Nov 6, 2019

Followup here: advancing actions by 1 tic in lockstep (round robin) among my players, but only updating state on every framek-skip-th tic (as described by @alex-petrenko) seems to be working well (as opposed to using makeAction with tics > 1).

Thanks again.

@mhe500
Copy link
Contributor Author

mhe500 commented Feb 7, 2020

Hi @Miffyli.

I'm reopening this because I'm revisiting multi-player games and am running into what I believe is the same or a very similar problem again.

As discussed above, I ensure that all of my environments step one tic at a time to keep them in sync and that has worked. However, what I've found is that if a multiplayer game sits idle for a minute or so, then the next time I try to step the environment (meaning, step all the environments that comprise that multiplayer game) I end up in a situation where vizdoom again gets stuck in a tight loop in TryRunTics(). This scenario might happen if, say, my trainer pauses to do an evaluation (on a separate environment), then resumes collection on an existing environment.

Question: Is there some internal mechanism that is causing this timeout behavior? I've tried increasing viz_sync_timeout CVAR (to very large values, e.g., 10 minutes) but it has not helped.

Thanks!
mhe

@mhe500 mhe500 reopened this Feb 7, 2020
@Miffyli
Copy link
Collaborator

Miffyli commented Feb 7, 2020

I am not familiar with the networking side, but if I had to guess I would look for some timeouts in the zdoom itself: if player(s) take too long to send their actions, even in standard multiplayer game, they are probably kicked out due to inactivity, which quite likely messes up with whole vizdoom multiplayer. I am not sure if old game like (z)doom included mechanics like this, but almost any modern video game has this.

@alex-petrenko
Copy link

alex-petrenko commented Feb 8, 2020

@mhe500 thank you for reporting this. In my setup workers would get stuck during initialization sometimes if I start a lot of environments in parallel. By the time the last environment is initialized, the first ones weren't stepped through for quite a while, and they seem to get stuck. This limits the number of envs I can start in parallel on a big server.

Please hit me up if you find a solution to this. I guess an easy workaround would be to insert a random action here or there to prevent this from happening, but it sounds like an ugly hack.

@mhe500
Copy link
Contributor Author

mhe500 commented Feb 8, 2020

Will do. Yesterday I read through the networking code in some more detail and started stepping through the stuck instance by attaching GDB and it appears that the stuck environment thinks the other environment is a tic behind where it really is (by viewing the nettics array).

I may try the -extratic flag to duplicate UDP packets in case its a packet being discarded somewhere.

@mhe500
Copy link
Contributor Author

mhe500 commented Feb 9, 2020

Ok, so here is what I think is happening:

  1. For whatever reason a tic UDP packet from A->B is lost (still don't know why this is happening).
  2. B needs the packet to proceed so it requests a re-transmit from A (B waiting on A) and polls in a tight loop in TryRunTics()
  3. A has already stepped and is waiting on instructions from my trainer to step again. A is waiting on the VizDoom message queue (i.e., waiting for another step request from the trainer).
  4. The trainer is waiting on environment B's step to complete.

Thus we have a deadlock: Trainer->B->A->Trainer. The lock-stepping has effectively broken Doom's re-transmit mechanism because A cannot process the re-transmit request because it is blocked on the ViZDoom message queue.

What I propose is a non-blocking queue check (even when in synchronous mode) and periodically processing network messages so that a node can response to re-transmit requests. This is likely more robust than finding the reason for the missing packet anyway.

The change would be in VIZ_MQTic():

    do {
        if(!*viz_async) {
            // CHANGED CODE
            //VIZ_MQReceive(&msg);
            while(!VIZ_MQTryReceive(&msg)) {
                NetUpdate();
                VIZ_Sleep(1000); // Edit: I realize this was a really idiotic thing to do, this should be using a receive timeout, but you get the point.
            } 
        }
        else if(!VIZ_MQTryReceive(&msg)) 
            break;

Any thoughts or comments about this idea? A cursory test shows it prevents the deadlock, but I'm not 100% certain whether it breaks something else.

@Miffyli
Copy link
Collaborator

Miffyli commented Feb 10, 2020

Nice catch! I agree trying to fix the UDP packet issue would be harder, based alone on the fact that UDP packets can disappear by design without reasons.

Being a busy-loop with functions that might do some other trickery (I am not familiar with NetUpdate and VIZ_Sleep), I figure this could be an optional flag. A "networking compatibility mode", if you will, which could trigger this type of modifications with simple if-else structures.

@mhe500
Copy link
Contributor Author

mhe500 commented Feb 10, 2020

I think I may have spoken too soon. I've spent the better part of the day trying to learn the zdoom/vizdoom timing and network code (it's not exactly straightforward) and there are a few reasons why what I suggested won't work (namely because from the POV of NetUpdate no time will have elapsed in during VIZ_MQTic() and as such it will refuse to send any new packets without some major changes.

My current experiment is to try forcing TryRunTics to run at most one tic at a time. For some reason that I don't yet understand, the 2 instances are not actually running in full lockstep, even though I'm invoking game.advance_action in lockstep.

I don't want to speak too soon again, so I'll try to test this approach a little better before updating ...

@mhe500
Copy link
Contributor Author

mhe500 commented Feb 11, 2020

Ok. After 2 days of struggling with this, I'm confident I've figured it out. What I observed using Wireshark is that, indeed, one packet is lost and indeed (as described above), in the lock-step manner in which we are driving zdoom, zdoom's built in retransmit mechanism doesn't work.

Adding some diagnostic code I noticed that sendto was returning EPERM, which indicated the packet was being rejected. The fact that this only happens after an idle time (which @alex-petrenko also noted) led me to the conclusion that this was packet filtering. Indeed, it turns out that it's the setting for Linux's packet filter's UDP inferred 'connection' tracking.

On my system sudo sysctl -a 2>&- | grep nf_conntrack_udp_timeout_stream shows 120 seconds. I increased it to 20 minutes with sudo sysctl -w net.netfilter.nf_conntrack_udp_timeout_stream=1200 and the problem went away (I'm pretty sure, but I've been known to speak too soon). I can induce the problem by deleting the connection state before the 10 min timeout using conntrack -D (you can examine the state with conntrack -L).

Hope this helps you @alex-petrenko.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants