Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

socket_manager: add feature to share sockets with another server #150

Conversation

daipom
Copy link
Contributor

@daipom daipom commented Oct 15, 2024

Another process can take over UDP/TCP sockets without downtime.

server = ServerEngine::SocketManager::Server.share_sockets_with_another_server(path)

This starts a new server that shares all UDP/TCP sockets with the existing server.
The old process should stop without removing the file for the socket after the new process starts.

This allows us to replace both the server and the workers with new processes without socket downtime.
(The existing live restart feature does not support network servers.
We can restart workers without socket downtime, but there is no such way for the network server.)

ref: fluent/fluentd#4622

Limitation

  • This feature would not work well if the process opens new TCP ports frequently.
    • If a client listens to a new socket during the sharing process, the new socket may not take over.
    • If this happens ...
      • Safe case: After the old server stops, the new server re-opens it.
      • Unsafe case: The new server tries to re-open it before the old server stops and causes address already in use error.
  • The old process must not remove the socket file when it stops.

TODO

  • Add tests
  • Consider limitations
  • Consider exclusive lock

@Watson1978 Watson1978 force-pushed the add-feature-to-take-over-another-server branch 3 times, most recently from 6546cf2 to 03bc031 Compare October 15, 2024 09:19
@ashie ashie marked this pull request as ready for review October 16, 2024 03:37
@daipom daipom marked this pull request as draft October 16, 2024 07:53
@daipom
Copy link
Contributor Author

daipom commented Oct 16, 2024

It would be necessary to consider exclusive locks.

@daipom daipom force-pushed the add-feature-to-take-over-another-server branch 2 times, most recently from 772d7df to 13b3d39 Compare October 21, 2024 03:22
@daipom
Copy link
Contributor Author

daipom commented Oct 21, 2024

https://github.com/treasure-data/serverengine/compare/772d7dfa672038fb0745eee7eb608799ef81b9a1..13b3d398a68d5924e9d5365a302b16c2804a1124

Somehow, the fork process couldn't receive SIGTERM, so I gave up using fork for the tests.

@daipom daipom marked this pull request as ready for review October 21, 2024 03:26
@daipom daipom force-pushed the add-feature-to-take-over-another-server branch from 13b3d39 to 00eba2a Compare October 21, 2024 05:44
@daipom daipom changed the title socket_manager: add feature to take over another server socket_manager: add feature to share sockets with another server Oct 21, 2024
@kenhys
Copy link
Collaborator

kenhys commented Oct 21, 2024

It may be better to note explicitly the scope of this PR (out of scope live-restart for supervisor, we focus on server <=> worker)
https://github.com/treasure-data/serverengine?tab=readme-ov-file#live-restart

@daipom daipom force-pushed the add-feature-to-take-over-another-server branch 2 times, most recently from 95ccac0 to a514542 Compare October 21, 2024 08:23
@daipom
Copy link
Contributor Author

daipom commented Oct 21, 2024

Thanks for your review!
This PR allows us to restart network servers without socket downtime.
I fixed the description of this PR and the commit message.

Copy link
Collaborator

@kenhys kenhys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@Watson1978 Watson1978 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

@daipom
Copy link
Contributor Author

daipom commented Oct 22, 2024

I'll rebase this.

Another process can take over UDP/TCP sockets without downtime.

    server = ServerEngine::SocketManager::Server.share_sockets_with_another_server(path)

This starts a new server that shares all UDP/TCP sockets with
the existing server.
The old process should stop without removing the file for the
socket after the new process starts.

This allows us to replace both the server and the
workers with new processes without socket downtime.
(The existing live restart feature does not support network
servers. We can restart workers without downtime, but there is
no way to restart the network server without downtime.)

ref: fluent/fluentd#4622

Limitation: This feature would not work well if the process
opens new TCP ports frequently.

Signed-off-by: Daijiro Fukuda <[email protected]>
Co-authored-by: Shizuo Fujita <[email protected]>
@daipom daipom force-pushed the add-feature-to-take-over-another-server branch from 1a839c7 to 9a42d6b Compare October 22, 2024 01:29
@kenhys kenhys merged commit d0a75c6 into treasure-data:master Oct 22, 2024
10 checks passed
@kenhys kenhys deleted the add-feature-to-take-over-another-server branch October 22, 2024 02:15
@daipom
Copy link
Contributor Author

daipom commented Oct 22, 2024

Thanks for your review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants