fix(router): handle client buffer overflow #1955

imsnif · 2022-11-17T19:01:24Z

What was happening

As per @raphCode 's excellent analysis in the issue comments, since the SSH session seems not to close the controlling terminal, after its buffer is filled, the client thread gets blocked printing to STDOUT, and then in turn the entire router thread on the Zellij server gets blocked when trying to send messages to the client.

The fix

Instead of sending messages directly to the client from the server, we now employ a separate thread to send messages to each client. This thread has its own buffer (50 messages), and once that buffer is filled, we stop sending messages to this particular client (essentially "forget" about it). If the client ever picks up its end of the IPC socket again, we send it one last "Buffer full" message before telling it to disconnect, so it knows what happened.

Other considerations

I initially wanted to fix this by making sending messages to the client non blocking with a buffer (to avoid the indirection of doing this in a separate thread), but the only way to do that would be to make the entire IPC channel non-blocking. And this would require quite some infrastructure change on our part (probably writing some event loop or using a shiny async framework to manage it).

raphCode · 2022-11-17T21:07:01Z

Oh no, more threads :D
(just fearing deadlocks in the future, also minor performance penalties. But I don't really have big objections if it works)

I originally thought that can just drop / disconnect clients that have not responded for a while? Or when the server tries to send the client a message and it would block.
Otherwise stalled clients may pile up in some scenarios?

Ideally, we redesign the whole server-client communication loop to make it absolutely fail-proof. Not sure if that is feasible, but discovering all the various ways the communication may fail through weird bugs and server crashes seems a bit bad. As far as I understand it, Rust's data model helps to make it obvious where things can fail by providing Results so any potential errors become apparent while programming.
(Except the blocking issue here which is not reflected in a return / data type - but the manual mentions blocking)

but the only way to do that would be to make the entire IPC channel non-blocking. And this would require quite some infrastructure change on our part (probably writing some event loop or using a shiny async framework to manage it).

I still think that making parts of our infrastructure async would be nice, especially as most threads have to take locks iiuc, thus not providing any parallelism, but just using the kernel as a costly async-environment.
But I also understand the cost of rewrite may be too high, so fixing this bug and moving on is also a legit move :)

imsnif · 2022-11-18T09:19:38Z

Oh no, more threads :D (just fearing deadlocks in the future, also minor performance penalties. But I don't really have big objections if it works)

That data isn't even cloned, it's moved entirely, so it's just moving it through another thread. If there's a performance penalty at all, I was not able to detect it when benchmarking this.

About deadlocks - while I understand your concern, really unless this is very much abused in the future by other developers (and you know, they can easily just put this infrastructure in themselves) there is no chance for deadlocks here. This is essentially just a wrapper around a blocking API.

I originally thought that can just drop / disconnect clients that have not responded for a while? Or when the server tries to send the client a message and it would block. Otherwise stalled clients may pile up in some scenarios?

This is what we're doing. Only thing we're holding on to is our end of the IPC pipe. The only way (that I know of) to not do this would be for the entire pipe to become non-blocking, which leads to the issues I talked about above.

Ideally, we redesign the whole server-client communication loop to make it absolutely fail-proof. Not sure if that is feasible, but discovering all the various ways the communication may fail through weird bugs and server crashes seems a bit bad. As far as I understand it, Rust's data model helps to make it obvious where things can fail by providing Results so any potential errors become apparent while programming. (Except the blocking issue here which is not reflected in a return / data type - but the manual mentions blocking)

Ideally we have a team of 10 people working full time on the project and can devote one person to spend all of their time for 1 month and do this. Yes. Alas, we live in an imperfect world :)

I don't think this particular part of Rust can help us here. The problems aren't crashes/panics, they are exactly these bits where things block. Either because of an API or because of a fault in our internal logic. Results will not help us here I'm afraid.

I still think that making parts of our infrastructure async would be nice, especially as most threads have to take locks iiuc, thus not providing any parallelism, but just using the kernel as a costly async-environment. But I also understand the cost of rewrite may be too high, so fixing this bug and moving on is also a legit move :)

We get some pretty nice results doing things as we do them now. We have lots of happy users, and since we're a beta - a lot of our current users are happy to lend a hand troubleshooting issues. It's the process most major OSS projects go through, and I dare say it makes for more stable software in the end.

imsnif · 2022-11-18T09:21:26Z

Thanks again for all your work on this @raphCode !

raphCode · 2022-11-18T21:55:23Z

I played around with many zellij clients and an overloaded system to see if clients disconnect because they read the buffer too slow, but it seems very hard to trigger. Should probably be fine.

That data isn't even cloned, it's moved entirely, so it's just moving it through another thread.

I am more concerned about kernel scheduling / context switching times - threads are not zero-cost at the OS level.
Maybe this is a too academic perspective, in my mind threads are most sensibly only used for parallelizing workloads so the scheduling overhead pays off because of overall performance gains.

imsnif added 2 commits November 17, 2022 19:40

fix(router): handle client buffer overflow

bca41af

style(fmt): rustfmt

6ff2785

imsnif had a problem deploying to cachix November 17, 2022 19:01 Failure

imsnif had a problem deploying to cachix November 17, 2022 19:26 Failure

imsnif had a problem deploying to cachix November 17, 2022 21:45 Failure

imsnif temporarily deployed to cachix November 18, 2022 08:04 Inactive

imsnif merged commit 2afb355 into main Nov 18, 2022

imsnif mentioned this pull request Nov 18, 2022

Killing client crashes server #1949

Closed

This was referenced Nov 22, 2022

Zellij hang reproducer #1813

Open

Certain actions may trigger client exit with "Buffer full" #1964

Closed

zellij session still freezing from time to time #1588

Closed

imsnif mentioned this pull request Dec 5, 2022

Server crashed when client tty was closed #1993

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(router): handle client buffer overflow #1955

fix(router): handle client buffer overflow #1955

imsnif commented Nov 17, 2022

raphCode commented Nov 17, 2022

imsnif commented Nov 18, 2022

imsnif commented Nov 18, 2022

raphCode commented Nov 18, 2022

fix(router): handle client buffer overflow #1955

fix(router): handle client buffer overflow #1955

Conversation

imsnif commented Nov 17, 2022

What was happening

The fix

Other considerations

raphCode commented Nov 17, 2022

imsnif commented Nov 18, 2022

imsnif commented Nov 18, 2022

raphCode commented Nov 18, 2022