-
-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Is synchronous play (Mode.PLAYER) possible in netgames? #417
Comments
See the discussion had in #391 . Using sync mode should be possible (I have not tried newest ViZDoom version yet, though), but there are few quirks mentioned in the issue that can cause deadlocking and whatnot. IIRC these issues have more to do with the underlying network code of ZDoom rather than ViZDoom API, so fixing them might be challenging (see #228) |
Thanks for the pointer. I am using a frame-skip of 4, and I now realize that actually may be putting the doom engine out of sync -- so my environments aren't really in "lock-step" as I'd thought. I will try the workaround using "update state" and see if I can make it work that way. Thank you. |
Followup here: advancing actions by 1 tic in lockstep (round robin) among my players, but only updating state on every framek-skip-th tic (as described by @alex-petrenko) seems to be working well (as opposed to using makeAction with tics > 1). Thanks again. |
Hi @Miffyli. I'm reopening this because I'm revisiting multi-player games and am running into what I believe is the same or a very similar problem again. As discussed above, I ensure that all of my environments step one tic at a time to keep them in sync and that has worked. However, what I've found is that if a multiplayer game sits idle for a minute or so, then the next time I try to step the environment (meaning, step all the environments that comprise that multiplayer game) I end up in a situation where vizdoom again gets stuck in a tight loop in TryRunTics(). This scenario might happen if, say, my trainer pauses to do an evaluation (on a separate environment), then resumes collection on an existing environment. Question: Is there some internal mechanism that is causing this timeout behavior? I've tried increasing viz_sync_timeout CVAR (to very large values, e.g., 10 minutes) but it has not helped. Thanks! |
I am not familiar with the networking side, but if I had to guess I would look for some timeouts in the zdoom itself: if player(s) take too long to send their actions, even in standard multiplayer game, they are probably kicked out due to inactivity, which quite likely messes up with whole vizdoom multiplayer. I am not sure if old game like (z)doom included mechanics like this, but almost any modern video game has this. |
@mhe500 thank you for reporting this. In my setup workers would get stuck during initialization sometimes if I start a lot of environments in parallel. By the time the last environment is initialized, the first ones weren't stepped through for quite a while, and they seem to get stuck. This limits the number of envs I can start in parallel on a big server. Please hit me up if you find a solution to this. I guess an easy workaround would be to insert a random action here or there to prevent this from happening, but it sounds like an ugly hack. |
Will do. Yesterday I read through the networking code in some more detail and started stepping through the stuck instance by attaching GDB and it appears that the stuck environment thinks the other environment is a tic behind where it really is (by viewing the nettics array). I may try the -extratic flag to duplicate UDP packets in case its a packet being discarded somewhere. |
Ok, so here is what I think is happening:
Thus we have a deadlock: Trainer->B->A->Trainer. The lock-stepping has effectively broken Doom's re-transmit mechanism because A cannot process the re-transmit request because it is blocked on the ViZDoom message queue. What I propose is a non-blocking queue check (even when in synchronous mode) and periodically processing network messages so that a node can response to re-transmit requests. This is likely more robust than finding the reason for the missing packet anyway. The change would be in do {
if(!*viz_async) {
// CHANGED CODE
//VIZ_MQReceive(&msg);
while(!VIZ_MQTryReceive(&msg)) {
NetUpdate();
VIZ_Sleep(1000); // Edit: I realize this was a really idiotic thing to do, this should be using a receive timeout, but you get the point.
}
}
else if(!VIZ_MQTryReceive(&msg))
break; Any thoughts or comments about this idea? A cursory test shows it prevents the deadlock, but I'm not 100% certain whether it breaks something else. |
Nice catch! I agree trying to fix the UDP packet issue would be harder, based alone on the fact that UDP packets can disappear by design without reasons. Being a busy-loop with functions that might do some other trickery (I am not familiar with |
I think I may have spoken too soon. I've spent the better part of the day trying to learn the zdoom/vizdoom timing and network code (it's not exactly straightforward) and there are a few reasons why what I suggested won't work (namely because from the POV of NetUpdate no time will have elapsed in during VIZ_MQTic() and as such it will refuse to send any new packets without some major changes. My current experiment is to try forcing TryRunTics to run at most one tic at a time. For some reason that I don't yet understand, the 2 instances are not actually running in full lockstep, even though I'm invoking game.advance_action in lockstep. I don't want to speak too soon again, so I'll try to test this approach a little better before updating ... |
Ok. After 2 days of struggling with this, I'm confident I've figured it out. What I observed using Wireshark is that, indeed, one packet is lost and indeed (as described above), in the lock-step manner in which we are driving zdoom, zdoom's built in retransmit mechanism doesn't work. Adding some diagnostic code I noticed that On my system Hope this helps you @alex-petrenko. |
Is it possible to play multiplayer games in fully synchronous mode?
The examples imply that Mode.ASYNC_PLAYER is necessary. In my experience the game will lock up and end up eating 100% CPU in d_net.cpp:TryRunTics() (at "wait for new tics if needed"). From the code it appears there was an attempt to make this work ("if(*viz_controlled && !*viz_async && netgame){...").
My question is whether or not this is possible or anyone has made it work? I've worked through my own code and believe I am stepping/resetting all my environments in lockstep and this does work for some number of tics, but then freezes.
Thank you!
mhe
The text was updated successfully, but these errors were encountered: