Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(router): handle client buffer overflow #1955

Merged
merged 2 commits into from
Nov 18, 2022

Conversation

imsnif
Copy link
Member

@imsnif imsnif commented Nov 17, 2022

Fixes #1781

What was happening

As per @raphCode 's excellent analysis in the issue comments, since the SSH session seems not to close the controlling terminal, after its buffer is filled, the client thread gets blocked printing to STDOUT, and then in turn the entire router thread on the Zellij server gets blocked when trying to send messages to the client.

The fix

Instead of sending messages directly to the client from the server, we now employ a separate thread to send messages to each client. This thread has its own buffer (50 messages), and once that buffer is filled, we stop sending messages to this particular client (essentially "forget" about it). If the client ever picks up its end of the IPC socket again, we send it one last "Buffer full" message before telling it to disconnect, so it knows what happened.

Other considerations

I initially wanted to fix this by making sending messages to the client non blocking with a buffer (to avoid the indirection of doing this in a separate thread), but the only way to do that would be to make the entire IPC channel non-blocking. And this would require quite some infrastructure change on our part (probably writing some event loop or using a shiny async framework to manage it).

@raphCode
Copy link
Contributor

Oh no, more threads :D
(just fearing deadlocks in the future, also minor performance penalties. But I don't really have big objections if it works)

I originally thought that can just drop / disconnect clients that have not responded for a while? Or when the server tries to send the client a message and it would block.
Otherwise stalled clients may pile up in some scenarios?


Ideally, we redesign the whole server-client communication loop to make it absolutely fail-proof. Not sure if that is feasible, but discovering all the various ways the communication may fail through weird bugs and server crashes seems a bit bad. As far as I understand it, Rust's data model helps to make it obvious where things can fail by providing Results so any potential errors become apparent while programming.
(Except the blocking issue here which is not reflected in a return / data type - but the manual mentions blocking)

but the only way to do that would be to make the entire IPC channel non-blocking. And this would require quite some infrastructure change on our part (probably writing some event loop or using a shiny async framework to manage it).

I still think that making parts of our infrastructure async would be nice, especially as most threads have to take locks iiuc, thus not providing any parallelism, but just using the kernel as a costly async-environment.
But I also understand the cost of rewrite may be too high, so fixing this bug and moving on is also a legit move :)

@imsnif imsnif temporarily deployed to cachix November 18, 2022 08:04 Inactive
@imsnif
Copy link
Member Author

imsnif commented Nov 18, 2022

Oh no, more threads :D (just fearing deadlocks in the future, also minor performance penalties. But I don't really have big objections if it works)

That data isn't even cloned, it's moved entirely, so it's just moving it through another thread. If there's a performance penalty at all, I was not able to detect it when benchmarking this.

About deadlocks - while I understand your concern, really unless this is very much abused in the future by other developers (and you know, they can easily just put this infrastructure in themselves) there is no chance for deadlocks here. This is essentially just a wrapper around a blocking API.

I originally thought that can just drop / disconnect clients that have not responded for a while? Or when the server tries to send the client a message and it would block. Otherwise stalled clients may pile up in some scenarios?

This is what we're doing. Only thing we're holding on to is our end of the IPC pipe. The only way (that I know of) to not do this would be for the entire pipe to become non-blocking, which leads to the issues I talked about above.

Ideally, we redesign the whole server-client communication loop to make it absolutely fail-proof. Not sure if that is feasible, but discovering all the various ways the communication may fail through weird bugs and server crashes seems a bit bad. As far as I understand it, Rust's data model helps to make it obvious where things can fail by providing Results so any potential errors become apparent while programming. (Except the blocking issue here which is not reflected in a return / data type - but the manual mentions blocking)

Ideally we have a team of 10 people working full time on the project and can devote one person to spend all of their time for 1 month and do this. Yes. Alas, we live in an imperfect world :)

I don't think this particular part of Rust can help us here. The problems aren't crashes/panics, they are exactly these bits where things block. Either because of an API or because of a fault in our internal logic. Results will not help us here I'm afraid.

I still think that making parts of our infrastructure async would be nice, especially as most threads have to take locks iiuc, thus not providing any parallelism, but just using the kernel as a costly async-environment. But I also understand the cost of rewrite may be too high, so fixing this bug and moving on is also a legit move :)

We get some pretty nice results doing things as we do them now. We have lots of happy users, and since we're a beta - a lot of our current users are happy to lend a hand troubleshooting issues. It's the process most major OSS projects go through, and I dare say it makes for more stable software in the end.

@imsnif
Copy link
Member Author

imsnif commented Nov 18, 2022

Thanks again for all your work on this @raphCode !

@imsnif imsnif merged commit 2afb355 into main Nov 18, 2022
@raphCode
Copy link
Contributor

I played around with many zellij clients and an overloaded system to see if clients disconnect because they read the buffer too slow, but it seems very hard to trigger. Should probably be fine.

That data isn't even cloned, it's moved entirely, so it's just moving it through another thread.

I am more concerned about kernel scheduling / context switching times - threads are not zero-cost at the OS level.
Maybe this is a too academic perspective, in my mind threads are most sensibly only used for parallelizing workloads so the scheduling overhead pays off because of overall performance gains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

session becomes unusable after network timeout
2 participants