Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

session becomes unusable after network timeout #1781

Closed
fansari opened this issue Oct 6, 2022 · 15 comments · Fixed by #1955
Closed

session becomes unusable after network timeout #1781

fansari opened this issue Oct 6, 2022 · 15 comments · Fixed by #1955

Comments

@fansari
Copy link

fansari commented Oct 6, 2022

I connect to a VM via PuTTY. There I open a zellij session.

If the PuTTY session has a network timeout and looses the connection the zellij session becomes unusable.

Only workaround so far for me is to close the session before I leave my notebook (timeout typically happens when I am leave for more than just a few minutes). Then I can reconnect via PuTTY and attach zellij again. But when I don't detach from the zellij session everything is messed when I come back.

I think this or something similar was reported here several times but it is still not fixed.

Tested with v0.31.4.

@raphCode
Copy link
Contributor

raphCode commented Oct 6, 2022

I am sorry for the experience and I understand the frustration of lost sessions.
We had several distinct bugs related to broken ssh connections. After we fixed one, often another one surfaced, so it may look like there is little to no progress overall.

For an immediate workaround you can try to replace ssh with mosh. It keeps a ssh connection even after network issues and you don't need to reconnect. All programs keep running so zellij won't mess up either.

To better understand the bug, can you give more information? What do you mean with "everything is messed when I come back"?
Is the session gone or does zellij hang upon attaching? Or something else entirely? Maybe a screenshot helps if there is UI corruption.

Also, it would be helpful it you could post the logfiles in /tmp/zellij-*/zellij-log/ after the problem happened.

@fansari
Copy link
Author

fansari commented Oct 7, 2022

What I have tried today is to set "seconds between keepalives" to 10 in PuTTY in the "Connection" section and set "Enable TCP Keepalives".

Nevertheless the connection to the VM breaks after some time when I am absent from my notebook.

But I have tried again to use "kill -9" on the "zellij attach" process and then reattach.

This has worked two times today. I will have to see whether this will always work or breaks again.

In the past (maybe with older versions) it also worked sometimes but then after a while everything began to "freeze" so I used to restart everything right from the beginning to avoid such freezing issues in the middle of my work.

Another option to avoid these issues I have tried was to run zellij locally within some Alpine container under WSL2 and then SSH from there.

But this gives again very bad copy&paste behaviour (this time not due to zellij but due to this WSL2 shell environment) so I went back to the solution to SSH some Linux VM and run zellij from there.

@fansari
Copy link
Author

fansari commented Oct 14, 2022

One part of the problem is the way Windows behaves in sleep mode. I have made several tests and found that SSH connections break when you put Windows to sleep for IPv6.

This is the message I get:

client_loop: send disconnect: Connection reset

I have made tests to different machines with IPv4 and IPv6. SSH breaks only for IPv6. For example logging in with IPv4 and IPv6 in parallel. After coming back from sleep mode the IPv4 connection is still there while the IPv6 connection is broken.

It behaves same in PuTTY or PowerShell.

There is an option in the device manager for the interface ("Power Mangement") called "Allow the computer to turn off this device to save power". This is enabled by default. I have tried to disable this but this has no effect.

Of course this is not the fault of zellij. But so far I don't know why Windows behaves this way.

@mpenning
Copy link

@fansari are you using wifi? If so, check to be sure that your wifi adapter does not go to sleep when you walk away...

@fansari
Copy link
Author

fansari commented Oct 18, 2022

My notebook is connected via cable to a switch. It is still unclear to me why the Windows sleep mode kills IPv6 while IPv4 survies.

https://learn.microsoft.com/en-us/answers/questions/1049220/why-do-ssh-connections-via-ipv6-break-when-the-sys.html

@raphCode
Copy link
Contributor

Could it be that tcp connection over IPv6 have a different timeout / close logic?


Please describe what the situation is after the ssh connection breaks:
Is the zellij session killed? Is the zellij server process still running?
Can you reattach? Does the zellij command hang when you try to attach?

@fansari
Copy link
Author

fansari commented Oct 19, 2022

I must admit I have avoided this because I had bad experiences. Sometimes I can reattach but then later things begin to freeze and if this is in the middle of something important it is not what I like.

Sometimes I see the process with the "zellij attach" after reconnecting and sometime only the server is stil running.

When I saw the "zellij attach" session I kill it because from my experience when I don't do this and try to attach it direcly hangs.

When I kill the "zellij attach" session as far as I remember sometimes this also kills the server process.

When the server process survives and the attach process got killed I sometimes reattached and it worked for a while. But as I said: this does not last long and in the moment you at least expect this the whole thing is frozen and you can start all over again.

So for the last weeks whenever I lost connection I directly killed all of zellij.

And since there is no answer from the Microsoft community so far there are only two options left: eitehr don't work with IPv6 or disable sleep mode. Or a thrid option if you have this possibilty: don't work with Windows.

@raphCode
Copy link
Contributor

I am sorry to see that the current zellij behavior is not suitable for doing actual work in your situation.

But the only way of us having a chance at fixing it is to understand the problem. Right now I am not even sure I correctly understand the error situation.

You mention "freezing" and "hanging". My interpretation of these words:

  • "Freezing": everything works correctly, and then suddenly the terminal does not refresh output nor reacts to new keypresses
  • "Hanging": when you execute a new zellij and nothing happens, just the cursor jumps to the next line and it waits indefinitely

Does this correspond to your understanding of these terms as well?

Maybe you can start a sacrificial zellij session next to your working session where you can observe the behavior and records logs from /tmp/zellij-*/zellij-log/?

@fansari
Copy link
Author

fansari commented Oct 26, 2022

With "freezing" you are right. With "hanging" I am not absolutely sure how it behaved. I think it did not open at all.

Current workaround for me was to disable the sleep mode in Windows. This way my IPv6 connection is not disconncted when I am away from the keyboard more than a few minutes. There is still no answer in the Microsoft forum why this happens.

@raphCode
Copy link
Contributor

Yeah, "hanging" includes that nothing opens and the only way out is Ctrl-C. I recently found a repro for these cases: #1813 so hopefully we can address that issue soon.
If you want you can try one of the reproducers and report if that yields the same "hanging" behavior as you experience.

Regarding the freezing, at first it seemed to me that this is related to network packets not getting delivered, buf after re-reading #1588 #1209 and this issue, I don't think it is that.

@fansari
Copy link
Author

fansari commented Nov 10, 2022

After I had a network drop I saw the "zellij --server and "zellij attach" both in the process list. I did "zellij attach" without killing the existing "zellij attach". After this I was able to get my session back. After a few seconds the keyboard did not react anymore.

zellij.log

@raphCode
Copy link
Contributor

Thanks for the log file! :)
Judging from the times, I think you attached first at 07:29, then had a network drop somewhere in between and re-attached at 09:18, after which everything froze. Can you confirm this?

@imsnif Maybe you are interested in the log, although I can't read anything too useful from it in the described time range. Maybe the previous part of the log gives a hint, I see quite some stuff going on there.

@raphCode
Copy link
Contributor

raphCode commented Nov 16, 2022

The issue actually was right in front of us all the time :)

Based on my finding today about the bad handling of a killed client (#1949) I suspected that we have more problems with clients misbehaving or not reacting.
@fansari actually mentioned multiple times that after the network problem there is still a zellij attach process lingering around. He also reported that killing it brings the server down, which we totally missed recognizing as a separate issue.
My theory:

  • After the network issue, the controlling terminal the zellij client is running in is not closed (maybe a peculiarity of the ssh setup fansari is using?), so the zellij client never exits
  • the zellij client continues to write repaints and output to its controlling terminal (ssh)
  • ssh does not know where to put the output since the network pipe is broken
  • buffers fill up, eventually zellij client blocks in a write syscall
  • the stalled zellij client also makes the server hang and any future clients attached to the same session

If there are no applications producing output since the network problem occured, the behavior is particularly devilish: One can attach a second client and it works for a bit, then everything freezes when the buffers filled up. I can see why this is frustrating enough to kill everything and start over.

With this theory it's quite easy to reproduce:

  • in terminal 1: zellij --session test
  • optional: start application that produces some output, I recommend curl parrot.live to see when it freezes
  • in terminal 2: timeout -s stop 1 zellij attach test

The first zellij session is still useable for some time (or should I say... for some terminal bytes? ;) ) until the write buffers on the suspended client fill up, freezing everything.
Resuming the second client with fg makes everything work again, as would killing it, but that is too risky until #1949 is fixed.

Hey @imsnif, your turn now ;)

@imsnif
Copy link
Member

imsnif commented Nov 17, 2022

Fantastic analysis @raphCode !!

I'm reproducing this and found the issue. Now mostly trying to figure out the best way to solve it. Will keep this thread posted.

@raphCode
Copy link
Contributor

@fansari

Until a fix is released, you may try to close the hanging client after a network crash - not by directly killing it, but by terminating the parent ssh process.
Here is a screenshot of a process tree taken from htop:
image

Send the marked process a SIGTERM and it should unblock any other zellij attach sessions :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants