-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement graceful shutdown #2812
Implement graceful shutdown #2812
Conversation
Ready for review. It passes tests, there is a CI issue, we discussed it in gitter |
There are quite a few changes in the lower-level peer and connection locking in here. Have you tried running it for a while? |
@ignopeverell I've been running it for 4 days, no issues so far. Also a few times tested sync from scratch stopping node at different random points. When a node is running with ~ 150 peers stop takes some time, 5-20 seconds. |
Is this still with non blocking io? If it's non-blocking, it should be able to be implemented with only a few millisecond delay on shutdown. Where is the delay coming from? |
@DavidBurkett yes, with non-blocking IO, it depends on state of threads. Eg sync was sleeping for 10 secs, I made some optimization to keep it under 1 sec. Also if a peer is processing a txhashset we would wait couple minutes. If a node is running the majority of time comes from waiting for a peers lock to be released. We have |
Thanks for the thorough explanation @hashmap. The syncing delays make sense, but the deadlock when there's a high number of peers is what I would like to be able to avoid. Once you get this merged in, I'm going to take a stab at trying to break the deadlock. In Grin++, I had a similar issue, but was able to deal with it using an additional mutex: https://github.com/GrinPlusPlus/GrinPlusPlus/blob/master/P2P/ConnectionManager.cpp#L247. We may be able to do something similar. |
@DavidBurkett makes sense, I've used a similar strategy in this PR, perhaps we can remove this delay too. To be on the same page we don't have a deadlock now, but a peer's thread may get stuck for 2 seconds during shutdown. |
@hashmap can you fix the conflicts? I'd like to have this in 1.1.0 beta2 for testing. |
@ignopeverell sure, in 5-7 hours, on a plane now |
Just been testing this locally with a handful of restarts. 👍 |
Currently grin server sets a stop flag and sends a stop message to each peer thread, waits for 1 second and exits. As result of the lack of graceful shutdown we introduced a stop mutex to protect critical phases like chain update etc. This global lock brings some performance penalties, also we need to manually find all such places and protect them.
This PR introduced a graceful shutdown using
thread::join
, so we track all important threads in grin and make sure that they exit. In this case we are sure that threads stop in safe places (flag or message check).Also mutex around read/sent stats for peer was removed, should improve perfomance.
This PR doesn't cover Stratum stop (we don't have it).
Also TUI update is needed, we need to inform user that we are shutting down, because shutdown may take a few seconds.