-
-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(router): handle client buffer overflow #1955
Conversation
Oh no, more threads :D I originally thought that can just drop / disconnect clients that have not responded for a while? Or when the server tries to send the client a message and it would block. Ideally, we redesign the whole server-client communication loop to make it absolutely fail-proof. Not sure if that is feasible, but discovering all the various ways the communication may fail through weird bugs and server crashes seems a bit bad. As far as I understand it, Rust's data model helps to make it obvious where things can fail by providing
I still think that making parts of our infrastructure async would be nice, especially as most threads have to take locks iiuc, thus not providing any parallelism, but just using the kernel as a costly async-environment. |
That data isn't even cloned, it's moved entirely, so it's just moving it through another thread. If there's a performance penalty at all, I was not able to detect it when benchmarking this. About deadlocks - while I understand your concern, really unless this is very much abused in the future by other developers (and you know, they can easily just put this infrastructure in themselves) there is no chance for deadlocks here. This is essentially just a wrapper around a blocking API.
This is what we're doing. Only thing we're holding on to is our end of the IPC pipe. The only way (that I know of) to not do this would be for the entire pipe to become non-blocking, which leads to the issues I talked about above.
Ideally we have a team of 10 people working full time on the project and can devote one person to spend all of their time for 1 month and do this. Yes. Alas, we live in an imperfect world :) I don't think this particular part of Rust can help us here. The problems aren't crashes/panics, they are exactly these bits where things block. Either because of an API or because of a fault in our internal logic.
We get some pretty nice results doing things as we do them now. We have lots of happy users, and since we're a beta - a lot of our current users are happy to lend a hand troubleshooting issues. It's the process most major OSS projects go through, and I dare say it makes for more stable software in the end. |
Thanks again for all your work on this @raphCode ! |
I played around with many zellij clients and an overloaded system to see if clients disconnect because they read the buffer too slow, but it seems very hard to trigger. Should probably be fine.
I am more concerned about kernel scheduling / context switching times - threads are not zero-cost at the OS level. |
Fixes #1781
What was happening
As per @raphCode 's excellent analysis in the issue comments, since the SSH session seems not to close the controlling terminal, after its buffer is filled, the client thread gets blocked printing to STDOUT, and then in turn the entire router thread on the Zellij server gets blocked when trying to send messages to the client.
The fix
Instead of sending messages directly to the client from the server, we now employ a separate thread to send messages to each client. This thread has its own buffer (50 messages), and once that buffer is filled, we stop sending messages to this particular client (essentially "forget" about it). If the client ever picks up its end of the IPC socket again, we send it one last "Buffer full" message before telling it to disconnect, so it knows what happened.
Other considerations
I initially wanted to fix this by making sending messages to the client non blocking with a buffer (to avoid the indirection of doing this in a separate thread), but the only way to do that would be to make the entire IPC channel non-blocking. And this would require quite some infrastructure change on our part (probably writing some event loop or using a shiny async framework to manage it).