Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement graceful shutdown #2812

Merged
merged 8 commits into from
May 15, 2019

Conversation

hashmap
Copy link
Contributor

@hashmap hashmap commented May 7, 2019

Currently grin server sets a stop flag and sends a stop message to each peer thread, waits for 1 second and exits. As result of the lack of graceful shutdown we introduced a stop mutex to protect critical phases like chain update etc. This global lock brings some performance penalties, also we need to manually find all such places and protect them.

This PR introduced a graceful shutdown using thread::join, so we track all important threads in grin and make sure that they exit. In this case we are sure that threads stop in safe places (flag or message check).
Also mutex around read/sent stats for peer was removed, should improve perfomance.

This PR doesn't cover Stratum stop (we don't have it).
Also TUI update is needed, we need to inform user that we are shutting down, because shutdown may take a few seconds.

@hashmap hashmap marked this pull request as ready for review May 10, 2019 17:52
@hashmap
Copy link
Contributor Author

hashmap commented May 11, 2019

Ready for review. It passes tests, there is a CI issue, we discussed it in gitter

@hashmap hashmap requested review from antiochp and ignopeverell May 11, 2019 07:05
@ignopeverell
Copy link
Contributor

There are quite a few changes in the lower-level peer and connection locking in here. Have you tried running it for a while?

@ignopeverell ignopeverell added this to the 1.1.1 milestone May 13, 2019
@hashmap
Copy link
Contributor Author

hashmap commented May 13, 2019

@ignopeverell I've been running it for 4 days, no issues so far. Also a few times tested sync from scratch stopping node at different random points. When a node is running with ~ 150 peers stop takes some time, 5-20 seconds.

@DavidBurkett
Copy link
Contributor

Is this still with non blocking io? If it's non-blocking, it should be able to be implemented with only a few millisecond delay on shutdown. Where is the delay coming from?

@hashmap
Copy link
Contributor Author

hashmap commented May 14, 2019

@DavidBurkett yes, with non-blocking IO, it depends on state of threads. Eg sync was sleeping for 10 secs, I made some optimization to keep it under 1 sec. Also if a peer is processing a txhashset we would wait couple minutes.

If a node is running the majority of time comes from waiting for a peers lock to be released. We have Peers object, it's like a manager of peers. It's shared and protected by RwLock. When we shutdown the server we take an exclusive write lock because we need to be sure that we stop all peers and no new peers will be added. However some peers may be doing something at that moment (processing a header, pinging peers, serving other peer request), which usually requires some level (usually read, sometimes write) of access to Peers object, which leads to a deadlock, because in the main thread we keep that lock and wait for those threads to finish. To prevent it I used try_lock on Peers object with 2 seconds timeout. During normal operations we take a Peer's lock for very short period of time, if we can't get it in 2 seconds it means that we are shutting down the server so it's safe to fail on taking a new lock. Perhaps we could decrease timeout.

@DavidBurkett
Copy link
Contributor

Thanks for the thorough explanation @hashmap. The syncing delays make sense, but the deadlock when there's a high number of peers is what I would like to be able to avoid. Once you get this merged in, I'm going to take a stab at trying to break the deadlock. In Grin++, I had a similar issue, but was able to deal with it using an additional mutex: https://github.com/GrinPlusPlus/GrinPlusPlus/blob/master/P2P/ConnectionManager.cpp#L247. We may be able to do something similar.

@hashmap
Copy link
Contributor Author

hashmap commented May 14, 2019

@DavidBurkett makes sense, I've used a similar strategy in this PR, perhaps we can remove this delay too. To be on the same page we don't have a deadlock now, but a peer's thread may get stuck for 2 seconds during shutdown.

@ignopeverell
Copy link
Contributor

@hashmap can you fix the conflicts? I'd like to have this in 1.1.0 beta2 for testing.

@ignopeverell ignopeverell modified the milestones: 1.1.1, 1.1.0 May 14, 2019
@hashmap
Copy link
Contributor Author

hashmap commented May 14, 2019

@ignopeverell sure, in 5-7 hours, on a plane now

@ignopeverell ignopeverell merged commit 9ab23f6 into mimblewimble:master May 15, 2019
@antiochp
Copy link
Member

antiochp commented May 15, 2019

Just been testing this locally with a handful of restarts. 👍
Both q and ctrl-c.

@antiochp antiochp added the release notes To be included in release notes (of relevant milestone). label Jun 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement release notes To be included in release notes (of relevant milestone).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants