-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race conditions on startup of an ssh connection with the client role #7550
Comments
I have create a PR with a fix for these issues: #7549 |
I think the added complexity must have been an oversight when a refactor was made for some other reason. It makes no sense at all to have an acceptor for the client. We will look into this and review your PR. |
Thanks for the feedback @IngelaAndin. If that's the case perhaps a better solution is to ignore my PR and go back to have the ssh_connection_handler under the sshc_sup like it was for instance in OTP 22. In any case, we have been testing for a couple of weeks now the changes of my PR and we no longer see errors due to the race conditions, so maybe like this is ok as well. |
Some context explaining update of ssh client supervision tree can be found in OTP-23.0 readme Due to above I don't think restoring OTP-22 code is the way to go. However revision of current supervision tree is needed.
|
It was decided to postpone work related to ssh supervision tree. |
Let's close this issue. It's in my TODO list to remove the patches and try to gather more information about the errors we observe without the patches. If I manage to gather more information about the issue I will open a new issue or re-open this one. |
(copy from other PR, putting here for reference as it is related here as well) |
Is OTP-22 not affected by this issue? Supervision complexity in |
OTP-22 is not affected because the ssh_connection_handler is immediately under the top level supervisor for the client. Probably the sshc_sup if my memory doesn't fail me. I have also looked recently to this issue to explore the conflicting ports theory. I can see that multiple processes are opening connections towards the same host. If internal ports can be reused (as it seems) then for the client role the current id of host+internal port is not enough. |
I agree with your suggestion. In OTP-22 simple_one_for_one is used and ssh_connection_handler are identified with just pids. This has advantages. After updates ssh_connection_handler process are basically identified with {local IP, local Port} which might not be enough. I'm not yet sure what path we should choose but I agree something has to be done with id used for ssh_connection_handler processes for improving situation. Due to above I'm re-opening this issue and propose to close it when we have some solution improving the discussed id. |
I suspect that unless we introduce a random identifier we can’t really have a good key for this. I may be able to try the key you suggest sometime in October, but for sure in late November. |
Yes. using random identifiers seems a simplest and safest choice at this stage. I've a prototype ssh code where system_sup is removed from the client side supervision tree. Subsystems sups are identified by refs. i wonder if there is a possibility of testing it in your environment (if it is provided as a PR maybe ...)? |
Yes, something that is likely to work we can try it now. Something that is likely not to work I would rather just try later because it brings a lot of noise in our test environment and we are close to a release. If you provide us with a PR or a branch I can make sure we start testing with it. |
|
Sounds good. I am confident that we understand the problem now and that the solution you propose will remove the problems with conflicting ports, therefore I would be ok start testing as soon as you have something. We have more active development in OTP 27, so we detect issues faster in OTP 27 so I suggest we try the fix for that version. |
@alexandrejbr any news? :-) |
@u3s Not yet. I’ll get back to you at the end of next week. |
@u3s so far things seem stable, but we only have been running it for a few days. |
Great. Thanks for feedback provided so far. Let us know if issue re-appears later. |
OTP-19124 * kuba/ssh/no_system_sup_for_client/GH-7550: ssh: rename ssh_subsystem_sup to ssh_connection_sup ssh: do_start_subsystem added, skip system_sup for client ssh: remove unused Address from function arguments ssh: ssh_connection_SUITE prints interesting events ssh: ssh_connection_SUITE interrupted_send explained in code comment
Related PR is merged and planned to be released. Closing the issue for now as I believe it is fixed. |
…19124' into maint-26 * kuba/maint-26/ssh/no_system_sup_for_client/GH-7550/OTP-19124: ssh: rename ssh_subsystem_sup to ssh_connection_sup ssh: do_start_subsystem added, skip system_sup for client ssh: remove unused Address from function arguments
…19124' into maint-25 * kuba/maint-25/ssh/no_system_sup_for_client/GH-7550/OTP-19124: ssh: rename ssh_subsystem_sup to ssh_connection_sup ssh: do_start_subsystem added, skip system_sup for client ssh: remove unused Address from function arguments
…to maint-27 * kuba/ssh/no_system_sup_for_client/GH-7550/OTP-19124: ssh: rename ssh_subsystem_sup to ssh_connection_sup ssh: do_start_subsystem added, skip system_sup for client ssh: remove unused Address from function arguments ssh: ssh_connection_SUITE prints interesting events ssh: ssh_connection_SUITE interrupted_send explained in code comment
Describe the bug
There were 2 race conditions that can happen on the startup of a SSH client connection:
Additionally in the first scenario the start_system function is trying to create an ssh_acceptor to any type of system, but an ssh_acceptor doesn't make sense for the client role so that's a bug as well I believe - in our tests the ssh_acceptor fails to start anyway and creates error reports.
To Reproduce
This is hard to reproduce but the best way is to create new connections to the same address in parallel for the first scenario. In order to reproduce the second scenario one can interleave come connection close and hopefully the issue shows itself.
Affected versions
OTP-24, OTP-25, OTP-26
Additional context
A client starting an SSH connection was simpler in previous versions, I wonder what was gained with the added complexity of having a system and a subsystem supervisor for the connections. Of course, the supervision tree looks more similar for client and server roles now, but other than that I struggle to see the benefit. Hopefully these were the only issues.
The text was updated successfully, but these errors were encountered: