Harden shutdown logic #1037

torarnv · 2015-04-08T01:10:21Z

Fixes in my quest to make the daemon shut down cleanly on Ctrl+C. Having the daemon stuck, especially without any logging/info to the user of why it's stuck, gives a bad impression to new users, as it's not evident that pressing Ctrl+C a second time will not interrupt an important task and corrupt something.

Spawned from ipfs/ipfs-webui#28

jbenet · 2015-04-08T01:13:09Z

cmd/ipfs/main.go

@@ -509,8 +514,10 @@ func (i *cmdInvocation) setupInterruptHandler() {
 			case 0:
 				fmt.Println(shutdownMessage)
 				go func() {
+					wg.Add(1)


this is racey-- the Add should go in the same goroutine as the wg gets initialized.

yeah, the add always needs to happen before a goroutine is spun off

jbenet · 2015-04-08T01:27:12Z

@torarnv this is looking good!

on the wait group, maybe something like:

type IntrCloser struct {
  c  chan os.Signal
  wg sync.WaitGroup
}

func NewIntrCloser() *IntrCloser {
  var ic := &IntrCloser{}
  ic.c = allInterruptSignals()
  ic.wg.Add(1)
  return ic
}

func (ic *IntrCloser) Done() {
  ic.wg.Done()
}

func (ic *IntrCloser) Close() error {
  signal.Stop(c)
  ic.wg.Wait()
  return nil
}


func (i *cmdInvocation) setupInterruptHandler() *sync.WaitGroup {

  intrc := NewIntrCloser()

  ...
  // make sure to call intrc.Done()
  ...

  return intrc
}

jbenet · 2015-04-08T01:28:11Z

can actually just replace the waitgroup with a channel:

type IntrCloser struct {
  c  chan os.Signal
  done chan struct{}
}

func NewIntrCloser() *IntrCloser {
  var ic := &IntrCloser{}
  ic.c = allInterruptSignals()
  ic.done = make(chan struct{})
  return ic
}

func (ic *IntrCloser) Done() {
  ic.done<- struct{}{}
}

func (ic *IntrCloser) Close() error {
  signal.Stop(c)
  <-ic.exit
  return nil
}


func (i *cmdInvocation) setupInterruptHandler() *sync.WaitGroup {

  intrc := NewIntrCloser()

  ...
  // make sure to call intrc.Done()
  ...

  return intrc
}

jbenet · 2015-04-08T01:28:59Z

@torarnv done with CR. gave suggestion on how to deal with the waitgroup problem, but no need to do it that way.

jbenet · 2015-04-12T07:06:56Z

hey @torarnv! i'd love to merge this in soon. any chance you could handle the CR comments? else i can do it.

torarnv · 2015-04-13T11:36:18Z

@jbenet Hey! Sorry, had some stuff come up. I'll look at this today!

jbenet · 2015-04-13T20:45:31Z

@torarnv will want to rebase-- i think diverged

torarnv · 2015-04-14T16:20:03Z

#1081 got in there, thought github might detect the dependency?

jbenet · 2015-04-14T16:23:06Z

@torarnv when we merge the other one, the merged commits will disappear from this listing

torarnv · 2015-04-17T16:44:13Z

Need to double-check the tests, but general feedback welcome!

jbenet · 2015-04-17T22:50:18Z

btw @torarnv one thing to do is run make test_all_commits locally.

(every commit has to pass so that we can rebase easily in the future-- it makes it really annoying if we do not follow this practice.)

jbenet · 2015-04-17T22:50:26Z

(will CR later tonight)

torarnv · 2015-04-18T16:29:01Z

@jbenet all tests pass locally for each commit in the PR 🎆

whyrusleeping · 2015-04-19T00:58:58Z

cmd/ipfs/daemon.go

+		select {
+		case <-ctx.Context.Done():
+			log.Info("Gracefully shut down daemon")
+		default:


what is the purpose of the default here? Wouldnt we want to wait and print the log message?

That's based on the possibility that the node/daemon may shut down through other means, eg through an API call or similar that wouldn't cancel the context, only close the node. In that case we wouldn't want to block. But if that's not really an option, or we do want to go through the context for those shutdowns as well, it shouldn't have a default.

gotcha, thanks!

whyrusleeping · 2015-04-19T01:04:19Z

If I could get a couple comments on the new struct and functions you added, that would be awesome. Otherwise this all looks good to me.

torarnv · 2015-04-19T01:07:44Z

You mean comments from me? Or you want to comment on them? 😃

whyrusleeping · 2015-04-19T01:59:26Z

cmd/ipfs/main.go

@@ -475,59 +469,62 @@ func writeHeapProfileToFile() error {
 	return pprof.WriteHeapProfile(mprof)
 }

-// listen for and handle SIGTERM
-func (i *cmdInvocation) setupInterruptHandler() {
+type IntrHandler struct {


comment on this struct

whyrusleeping · 2015-04-19T02:00:50Z

Pointed out the two i'd like commented, your commit messages are far nicer than mine, thank for you that :) . I mostly like having the comments on structs and functions to make it easy to grok the code through GoDoc

jbenet · 2015-04-20T03:30:27Z

Will be great to have this in.

jbenet · 2015-04-20T10:51:16Z

(this will probably need rebasing on master, sorry. happy to rebase it for you if desired)

This reverts commit f74e71f. The 'Online' flag of the command context does not seem to be set in any code paths, at least not when running commands such as 'ipfs daemon' or 'ipfs ping'. The result after f74e71f is that we never shutdown cleanly, as we'll always os.Exit(0) from the interrupt handler. The os.Exit(0) itself is also dubious, as conceptually the interrupt handler should ask whatever is stalling to stop stalling, so that main() can return like normal. Exiting with -1 in error cases where the interrupt handler is unable to stop the stall is fine, but the normal case of interrupting cleanly should exit through main().

The server may stay alive for quite a while due to waiting on open connections to close before shutting down. We should find ways to terminate these connections in a more controlled manner, but in the meantime it's helpful to be able to see why a shutdown of the ipfs daemon is taking so long.

When closing a node, the node itself only takes care of tearing down its own children. As corehttp sets up a server based on a node, it needs to also ensure that the server is accounted for when determining if the node has been fully closed.

Once the server is asked to shut down, we stop accepting new connections, but the 'manners' graceful shutdown will wait for all existing connections closed to close before finishing. For keep-alive connections this will never happen unless the client detects that the server is shutting down through the ipfs API itself, and closes the connection in response. This is a problem e.g. with the webui's connections visualization, which polls the swarm/peers endpoint once a second, and never detects that the API server was shut down. We can mitigate this by telling the server to disable keep-alive, which will add a 'Connection: close' header to the next HTTP response on the connection. A well behaving client should then treat that correspondingly by closing the connection. Unfortunately this doesn't happen immediately in all cases, presumably depending on the keep-alive timeout of the browser that set up the connection, but it's at least a step in the right direction.

If a command invocation such as 'daemon' is interrupted, the interrupt handler asks the node to close. The closing of the node will result in the command invocation finishing, and possibly returning from main() before the interrupt handler is done. In particular, the info logging that a graceful shutdown was completed may never reach reach stdout. As the whole point of logging "Gracefully shut down." is to give confidence when debugging that the shutdown was clean, this is slightly unfortunate. The interrupt handler needs to be set up in main() instead of Run(), so that we can defer the closing of the interrupt handler until just before returning from main, not when Run() returns with a streaming result reader.

The context passed on from main() may change before we hit callCommand, so setting it in Parse is a bit optimistic.

The context may be cancelled while a request is in flight. We need to handle this and cancel the request. The code is based on the ideas from https://blog.golang.org/context

When the response includes the X-Chunked-Output header, we treat that as channel output, and fire up a goroutine to decode the chunks. This routine need to look for context cancellation so that it can exit cleanly.

Instead of assuming the command is the daemon command and closing the node, which resulted in bugs like ipfs#1053, we cancel the context and let the context children detect the cancellation and gracefully clean up after themselves. The shutdown logging has been moved into the daemon command, where it makes more sense, so that commands like ping will not print out the same output on cancellation.

Instead of just terminating right there and then, we cancel the context, and let the daemon exit cleanly. This make take a few seconds, as the node builder and its child processes do not care too much about the context state while building nodes, but this can be improved by injecting checks for ctx.Done() before time-consuming steps.

torarnv · 2015-04-20T14:36:27Z

Added comments and rebased.

Harden shutdown logic

jbenet · 2015-04-21T08:26:10Z

@torarnv this is now in master. thank you very much!

torarnv · 2015-04-21T13:26:56Z

Awesome! 🎉 Let me know if there are regressions the testsuite didn't catch, or similar issues/related areas that I could tackle as an extension of this.

jbenet · 2015-04-22T05:26:50Z

@torarnv no issues so far.

so-- related issues are daemon init + startup. there's enough issues around this to warrant a whole sprint: https://github.com/ipfs/go-ipfs/labels/daemon%20%2B%20init may be worth picking small ones of those or scoping out what changes need to happen (there's plenty of cmdslib crap to do too there-- cc @whyrusleeping )

whyrusleeping added the ready label Apr 8, 2015

jbenet added backlog and removed ready labels Apr 8, 2015

jbenet reviewed Apr 8, 2015
View reviewed changes

whyrusleeping added status/in-progress In progress and removed backlog labels Apr 13, 2015

torarnv force-pushed the harden-shutdown-logic branch from 6c64360 to 9914ca5 Compare April 14, 2015 16:18

jbenet mentioned this pull request Apr 14, 2015

Revert "fix ugly error message when killing commands" #1081

Closed

torarnv force-pushed the harden-shutdown-logic branch 2 times, most recently from a04cad4 to ef409c0 Compare April 17, 2015 16:47

dylanPowers mentioned this pull request Apr 17, 2015

IPFS daemon doesn't unmount when killed #1092

Closed

whyrusleeping reviewed Apr 19, 2015
View reviewed changes

torarnv added 10 commits April 20, 2015 14:55

Sync up request context with root context just before calling command

3d05764

The context passed on from main() may change before we hit callCommand, so setting it in Parse is a bit optimistic.

Teach http client to cancel request on context cancellation

661fb0a

The context may be cancelled while a request is in flight. We need to handle this and cancel the request. The code is based on the ideas from https://blog.golang.org/context

Teach http client to abort channel streaming on context cancellation

cc45e21

When the response includes the X-Chunked-Output header, we treat that as channel output, and fire up a goroutine to decode the chunks. This routine need to look for context cancellation so that it can exit cleanly.

torarnv force-pushed the harden-shutdown-logic branch from ef409c0 to bfd1211 Compare April 20, 2015 14:36

jbenet added sprint and removed status/in-progress In progress labels Apr 20, 2015

jbenet self-assigned this Apr 20, 2015

jbenet mentioned this pull request Apr 20, 2015

Sprint D ipfs/team-mgmt#5

Closed

34 tasks

jbenet added a commit that referenced this pull request Apr 21, 2015

Merge pull request #1037 from torarnv/harden-shutdown-logic

db56c0f

Harden shutdown logic

jbenet merged commit db56c0f into ipfs:master Apr 21, 2015

jbenet added sprint and removed sprint labels Apr 21, 2015

torarnv mentioned this pull request Apr 23, 2015

interrupt handler attempts to get IPFSNode from client process while daemon is running #615

Closed

aschmahmann mentioned this pull request May 14, 2021

Release v0.9.0 #8058

Closed

71 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden shutdown logic #1037

Harden shutdown logic #1037

torarnv commented Apr 8, 2015

jbenet Apr 8, 2015

whyrusleeping Apr 8, 2015

jbenet commented Apr 8, 2015

jbenet commented Apr 8, 2015

jbenet commented Apr 8, 2015

jbenet commented Apr 12, 2015

torarnv commented Apr 13, 2015

jbenet commented Apr 13, 2015

torarnv commented Apr 14, 2015

jbenet commented Apr 14, 2015

torarnv commented Apr 17, 2015

jbenet commented Apr 17, 2015

jbenet commented Apr 17, 2015

torarnv commented Apr 18, 2015

whyrusleeping Apr 19, 2015

torarnv Apr 19, 2015

whyrusleeping Apr 19, 2015

whyrusleeping commented Apr 19, 2015

torarnv commented Apr 19, 2015

whyrusleeping Apr 19, 2015

whyrusleeping commented Apr 19, 2015

jbenet commented Apr 20, 2015

jbenet commented Apr 20, 2015

torarnv commented Apr 20, 2015

jbenet commented Apr 21, 2015

torarnv commented Apr 21, 2015

jbenet commented Apr 22, 2015

Harden shutdown logic #1037

Harden shutdown logic #1037

Conversation

torarnv commented Apr 8, 2015

jbenet Apr 8, 2015

Choose a reason for hiding this comment

whyrusleeping Apr 8, 2015

Choose a reason for hiding this comment

jbenet commented Apr 8, 2015

jbenet commented Apr 8, 2015

jbenet commented Apr 8, 2015

jbenet commented Apr 12, 2015

torarnv commented Apr 13, 2015

jbenet commented Apr 13, 2015

torarnv commented Apr 14, 2015

jbenet commented Apr 14, 2015

torarnv commented Apr 17, 2015

jbenet commented Apr 17, 2015

jbenet commented Apr 17, 2015

torarnv commented Apr 18, 2015

whyrusleeping Apr 19, 2015

Choose a reason for hiding this comment

torarnv Apr 19, 2015

Choose a reason for hiding this comment

whyrusleeping Apr 19, 2015

Choose a reason for hiding this comment

whyrusleeping commented Apr 19, 2015

torarnv commented Apr 19, 2015

whyrusleeping Apr 19, 2015

Choose a reason for hiding this comment

whyrusleeping commented Apr 19, 2015

jbenet commented Apr 20, 2015

jbenet commented Apr 20, 2015

torarnv commented Apr 20, 2015

jbenet commented Apr 21, 2015

torarnv commented Apr 21, 2015

jbenet commented Apr 22, 2015