Implement graceful shutdown #2812

hashmap · 2019-05-07T17:56:44Z

Currently grin server sets a stop flag and sends a stop message to each peer thread, waits for 1 second and exits. As result of the lack of graceful shutdown we introduced a stop mutex to protect critical phases like chain update etc. This global lock brings some performance penalties, also we need to manually find all such places and protect them.

This PR introduced a graceful shutdown using thread::join, so we track all important threads in grin and make sure that they exit. In this case we are sure that threads stop in safe places (flag or message check).
Also mutex around read/sent stats for peer was removed, should improve perfomance.

This PR doesn't cover Stratum stop (we don't have it).
Also TUI update is needed, we need to inform user that we are shutting down, because shutdown may take a few seconds.

hashmap · 2019-05-11T07:04:55Z

Ready for review. It passes tests, there is a CI issue, we discussed it in gitter

ignopeverell · 2019-05-13T21:52:52Z

There are quite a few changes in the lower-level peer and connection locking in here. Have you tried running it for a while?

hashmap · 2019-05-13T22:16:54Z

@ignopeverell I've been running it for 4 days, no issues so far. Also a few times tested sync from scratch stopping node at different random points. When a node is running with ~ 150 peers stop takes some time, 5-20 seconds.

DavidBurkett · 2019-05-13T23:02:55Z

Is this still with non blocking io? If it's non-blocking, it should be able to be implemented with only a few millisecond delay on shutdown. Where is the delay coming from?

hashmap · 2019-05-14T09:06:42Z

@DavidBurkett yes, with non-blocking IO, it depends on state of threads. Eg sync was sleeping for 10 secs, I made some optimization to keep it under 1 sec. Also if a peer is processing a txhashset we would wait couple minutes.

If a node is running the majority of time comes from waiting for a peers lock to be released. We have Peers object, it's like a manager of peers. It's shared and protected by RwLock. When we shutdown the server we take an exclusive write lock because we need to be sure that we stop all peers and no new peers will be added. However some peers may be doing something at that moment (processing a header, pinging peers, serving other peer request), which usually requires some level (usually read, sometimes write) of access to Peers object, which leads to a deadlock, because in the main thread we keep that lock and wait for those threads to finish. To prevent it I used try_lock on Peers object with 2 seconds timeout. During normal operations we take a Peer's lock for very short period of time, if we can't get it in 2 seconds it means that we are shutting down the server so it's safe to fail on taking a new lock. Perhaps we could decrease timeout.

DavidBurkett · 2019-05-14T11:29:27Z

Thanks for the thorough explanation @hashmap. The syncing delays make sense, but the deadlock when there's a high number of peers is what I would like to be able to avoid. Once you get this merged in, I'm going to take a stab at trying to break the deadlock. In Grin++, I had a similar issue, but was able to deal with it using an additional mutex: https://github.com/GrinPlusPlus/GrinPlusPlus/blob/master/P2P/ConnectionManager.cpp#L247. We may be able to do something similar.

hashmap · 2019-05-14T12:02:34Z

@DavidBurkett makes sense, I've used a similar strategy in this PR, perhaps we can remove this delay too. To be on the same page we don't have a deadlock now, but a peer's thread may get stuck for 2 seconds during shutdown.

ignopeverell · 2019-05-14T21:02:48Z

@hashmap can you fix the conflicts? I'd like to have this in 1.1.0 beta2 for testing.

hashmap · 2019-05-14T22:26:25Z

@ignopeverell sure, in 5-7 hours, on a plane now

antiochp · 2019-05-15T21:07:17Z

Just been testing this locally with a handful of restarts. 👍
Both q and ctrl-c.

hashmap added 6 commits May 6, 2019 19:42

first pass

106f91b

checkpoint

63727c1

Remove stop status mutex

8720483

remove some deadlocks

7aece30

Rewrite stop channel handling

eed759d

fix deadlock in peers object

4ea756c

hashmap marked this pull request as ready for review May 10, 2019 17:52

add missing test fixes

235ebda

hashmap requested review from antiochp and ignopeverell May 11, 2019 07:05

ignopeverell added the enhancement label May 13, 2019

ignopeverell added this to the 1.1.1 milestone May 13, 2019

DavidBurkett approved these changes May 14, 2019

View reviewed changes

ignopeverell modified the milestones: 1.1.1, 1.1.0 May 14, 2019

merge master

7a20ea8

ignopeverell merged commit 9ab23f6 into mimblewimble:master May 15, 2019

This was referenced May 17, 2019

clean-up work when process quit #2364

Closed

REST API Server Need Support Stop #1934

Closed

antiochp added the release notes To be included in release notes (of relevant milestone). label Jun 5, 2019

hashmap mentioned this pull request Jul 3, 2019

Interrupting grin during chain validation results in corrupted data #2770

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement graceful shutdown #2812

Implement graceful shutdown #2812

hashmap commented May 7, 2019 •

edited

Loading

hashmap commented May 11, 2019

ignopeverell commented May 13, 2019

hashmap commented May 13, 2019

DavidBurkett commented May 13, 2019

hashmap commented May 14, 2019

DavidBurkett commented May 14, 2019

hashmap commented May 14, 2019

ignopeverell commented May 14, 2019

hashmap commented May 14, 2019

antiochp commented May 15, 2019 •

edited

Loading

Implement graceful shutdown #2812

Implement graceful shutdown #2812

Conversation

hashmap commented May 7, 2019 • edited Loading

hashmap commented May 11, 2019

ignopeverell commented May 13, 2019

hashmap commented May 13, 2019

DavidBurkett commented May 13, 2019

hashmap commented May 14, 2019

DavidBurkett commented May 14, 2019

hashmap commented May 14, 2019

ignopeverell commented May 14, 2019

hashmap commented May 14, 2019

antiochp commented May 15, 2019 • edited Loading

hashmap commented May 7, 2019 •

edited

Loading

antiochp commented May 15, 2019 •

edited

Loading