Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race conditions on startup of an ssh connection with the client role #7550

Closed
alexandrejbr opened this issue Aug 11, 2023 · 21 comments · Fixed by #8766
Closed

Race conditions on startup of an ssh connection with the client role #7550

alexandrejbr opened this issue Aug 11, 2023 · 21 comments · Fixed by #8766
Assignees
Labels
bug Issue is reported as a bug team:PS Assigned to OTP team PS

Comments

@alexandrejbr
Copy link

Describe the bug
There were 2 race conditions that can happen on the startup of a SSH client connection:

  1. two process race to create the ssh_system_sup for a given address, which can cause the start of the ssh_system_sup to return {error, {already_started, Pid}};
  2. one process race to create the ssh_system_sup when it's being brought down due to all significant processes have exited, which makes the start of the ssh_subsystem_sup to crash on supervisor:start_child with noproc since the ssh_system_sup has terminated in the mean time.

Additionally in the first scenario the start_system function is trying to create an ssh_acceptor to any type of system, but an ssh_acceptor doesn't make sense for the client role so that's a bug as well I believe - in our tests the ssh_acceptor fails to start anyway and creates error reports.

To Reproduce
This is hard to reproduce but the best way is to create new connections to the same address in parallel for the first scenario. In order to reproduce the second scenario one can interleave come connection close and hopefully the issue shows itself.

Affected versions
OTP-24, OTP-25, OTP-26

Additional context
A client starting an SSH connection was simpler in previous versions, I wonder what was gained with the added complexity of having a system and a subsystem supervisor for the connections. Of course, the supervision tree looks more similar for client and server roles now, but other than that I struggle to see the benefit. Hopefully these were the only issues.

@alexandrejbr alexandrejbr added the bug Issue is reported as a bug label Aug 11, 2023
@alexandrejbr
Copy link
Author

I have create a PR with a fix for these issues: #7549

@u3s u3s added the team:PS Assigned to OTP team PS label Aug 11, 2023
@IngelaAndin
Copy link
Contributor

I think the added complexity must have been an oversight when a refactor was made for some other reason. It makes no sense at all to have an acceptor for the client. We will look into this and review your PR.

@alexandrejbr
Copy link
Author

Thanks for the feedback @IngelaAndin. If that's the case perhaps a better solution is to ignore my PR and go back to have the ssh_connection_handler under the sshc_sup like it was for instance in OTP 22.

In any case, we have been testing for a couple of weeks now the changes of my PR and we no longer see errors due to the race conditions, so maybe like this is ok as well.

@u3s u3s self-assigned this Aug 22, 2023
@u3s u3s linked a pull request Aug 22, 2023 that will close this issue
@u3s
Copy link
Contributor

u3s commented Nov 8, 2023

Additional context A client starting an SSH connection was simpler in previous versions, I wonder what was gained with the added complexity of having a system and a subsystem supervisor for the connections. Of course, the supervision tree looks more similar for client and server roles now, but other than that I struggle to see the benefit. Hopefully these were the only issues.

Some context explaining update of ssh client supervision tree can be found in OTP-23.0 readme

Due to above I don't think restoring OTP-22 code is the way to go.

However revision of current supervision tree is needed.

  • You are right that starting acceptor on client side should not happen
  • The id used for registering ssh_system_sup per client side connection requires verification. maybe using #address record (local address, local port and profile) is not a good choice (this is why you get already started errors)
  • I'm not yet sure if system and subsystem supervisors are needed

@IngelaAndin IngelaAndin added the stalled waiting for input by the Erlang/OTP team label Dec 12, 2023
@u3s
Copy link
Contributor

u3s commented Feb 21, 2024

It was decided to postpone work related to ssh supervision tree.
I would like to revise how supervision tree is implemented and then decide how it should be fixed.

@alexandrejbr
Copy link
Author

Let's close this issue. It's in my TODO list to remove the patches and try to gather more information about the errors we observe without the patches.

If I manage to gather more information about the issue I will open a new issue or re-open this one.

@u3s
Copy link
Contributor

u3s commented Feb 22, 2024

(copy from other PR, putting here for reference as it is related here as well)
OK. In general I agree there are issues with supervision tree appearing in some border cases.
But I'm not sure if the solutions proposed at this stage are the optimal ones ... and need more time for checking that.

@u3s
Copy link
Contributor

u3s commented Jul 17, 2024

Is OTP-22 not affected by this issue?
I'm revising ssh supervision tree evolution to figure out how it could be improved. Your feedback would help.

Supervision complexity in ssh:connect was first brought in OTP-23 and later evolved.
I would expect OTP-23 to have similar issues as OTP-24,25,26 (listed as affected versions in description). Do you know if OTP-23 is also affected?

@alexandrejbr
Copy link
Author

OTP-22 is not affected because the ssh_connection_handler is immediately under the top level supervisor for the client. Probably the sshc_sup if my memory doesn't fail me.

I have also looked recently to this issue to explore the conflicting ports theory. I can see that multiple processes are opening connections towards the same host. If internal ports can be reused (as it seems) then for the client role the current id of host+internal port is not enough.

@u3s
Copy link
Contributor

u3s commented Aug 30, 2024

I agree with your suggestion.

In OTP-22 simple_one_for_one is used and ssh_connection_handler are identified with just pids. This has advantages.

After updates ssh_connection_handler process are basically identified with {local IP, local Port} which might not be enough.
Using {local IP, local Port, remote IP, remote Port} might work better.

I'm not yet sure what path we should choose but I agree something has to be done with id used for ssh_connection_handler processes for improving situation.

Due to above I'm re-opening this issue and propose to close it when we have some solution improving the discussed id.

@u3s u3s reopened this Aug 30, 2024
@alexandrejbr
Copy link
Author

I suspect that unless we introduce a random identifier we can’t really have a good key for this. I may be able to try the key you suggest sometime in October, but for sure in late November.

@u3s
Copy link
Contributor

u3s commented Aug 30, 2024

Yes. using random identifiers seems a simplest and safest choice at this stage.

I've a prototype ssh code where system_sup is removed from the client side supervision tree. Subsystems sups are identified by refs.

i wonder if there is a possibility of testing it in your environment (if it is provided as a PR maybe ...)?

@u3s u3s removed stalled waiting for input by the Erlang/OTP team priority:low labels Aug 30, 2024
@alexandrejbr
Copy link
Author

Yes, something that is likely to work we can try it now. Something that is likely not to work I would rather just try later because it brings a lot of noise in our test environment and we are close to a release.

If you provide us with a PR or a branch I can make sure we start testing with it.

@u3s
Copy link
Contributor

u3s commented Sep 2, 2024

  • which OTP release is a best fit for you ? 27, 26 or 25 ?
  • the concept of removing system sup for ssh client has been discussed and accepted internally
  • as it comes to quality of PR code to be verified - it would be tested with our standard lab tests which are run before every release or patch for ssh
  • it is not urgent - we can wait some time more if it is more convenient for you

@alexandrejbr
Copy link
Author

Sounds good. I am confident that we understand the problem now and that the solution you propose will remove the problems with conflicting ports, therefore I would be ok start testing as soon as you have something. We have more active development in OTP 27, so we detect issues faster in OTP 27 so I suggest we try the fix for that version.

@u3s
Copy link
Contributor

u3s commented Sep 2, 2024

PR-#8766 is dedicated for OTP-27 release . To apply prototype you need to cherry-pick 2 commits:

  1. Firstly take 6c02d43 (this is just a code refactor removing unused argument but in related area)
  2. Secondly take 4727d51 (this removes system supervisor in client supervision tree)

@u3s
Copy link
Contributor

u3s commented Sep 10, 2024

@alexandrejbr any news? :-)

@alexandrejbr
Copy link
Author

@u3s Not yet. I’ll get back to you at the end of next week.

@u3s u3s linked a pull request Sep 10, 2024 that will close this issue
@alexandrejbr
Copy link
Author

@u3s so far things seem stable, but we only have been running it for a few days.

@u3s
Copy link
Contributor

u3s commented Sep 20, 2024

Great. Thanks for feedback provided so far. Let us know if issue re-appears later.
We are about to merge PR-8766 as a bug fix for supported OTP versions (25, 26, 27). Apart from (hopefully) fixing this issue it also simplifies supervision tree structure and related code.

u3s added a commit that referenced this issue Sep 23, 2024
OTP-19124

* kuba/ssh/no_system_sup_for_client/GH-7550:
  ssh: rename ssh_subsystem_sup to ssh_connection_sup
  ssh: do_start_subsystem added, skip system_sup for client
  ssh: remove unused Address from function arguments
  ssh: ssh_connection_SUITE prints interesting events
  ssh: ssh_connection_SUITE interrupted_send explained in code comment
@u3s
Copy link
Contributor

u3s commented Sep 23, 2024

Related PR is merged and planned to be released. Closing the issue for now as I believe it is fixed.

@u3s u3s closed this as completed Sep 23, 2024
IngelaAndin pushed a commit that referenced this issue Oct 9, 2024
…19124' into maint-26

* kuba/maint-26/ssh/no_system_sup_for_client/GH-7550/OTP-19124:
  ssh: rename ssh_subsystem_sup to ssh_connection_sup
  ssh: do_start_subsystem added, skip system_sup for client
  ssh: remove unused Address from function arguments
Mikaka27 pushed a commit that referenced this issue Oct 14, 2024
…19124' into maint-25

* kuba/maint-25/ssh/no_system_sup_for_client/GH-7550/OTP-19124:
  ssh: rename ssh_subsystem_sup to ssh_connection_sup
  ssh: do_start_subsystem added, skip system_sup for client
  ssh: remove unused Address from function arguments
jhogberg pushed a commit that referenced this issue Oct 17, 2024
…to maint-27

* kuba/ssh/no_system_sup_for_client/GH-7550/OTP-19124:
  ssh: rename ssh_subsystem_sup to ssh_connection_sup
  ssh: do_start_subsystem added, skip system_sup for client
  ssh: remove unused Address from function arguments
  ssh: ssh_connection_SUITE prints interesting events
  ssh: ssh_connection_SUITE interrupted_send explained in code comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:PS Assigned to OTP team PS
Projects
None yet
3 participants