Graceful shutdown #517

jbygdell · 2023-01-24T12:40:40Z

handle interrupt signals from the OS
remove all log.Fatal to make sure we exit cleanly in all cases.
linter fixes - return with no blank line before (nlreturn)

pontus · 2023-01-24T13:50:14Z

Looks good.

Some possible considerations:

Would it make sense to log when exiting due to signal received (even though the "internal" signals might be confusing)?
cmd/api/api.go has a few log.Fatalln - should those also be replaced?
Maybe the signal handler is called enough that it's worth making it shared (supposedly being passwed mq, db or possibly nil if either isn't used)?

jbygdell · 2023-01-25T13:07:03Z

Looks good.

Some possible considerations:

Would it make sense to log when exiting due to signal received (even though the "internal" signals might be confusing)?

cmd/api/api.go has a few log.Fatalln - should those also be replaced?

Since we call shutdown() befor calling log.Fatalln we ensure that the DB and MQ connections are closed cleanly

Maybe the signal handler is called enough that it's worth making it shared (supposedly being passwed mq, db or possibly nil if either isn't used)?

pontus

Now that I actually manage to think; (generally) execution needs to be paused in the problematic thread as well.

Otherwise we risk e.g. failing to initialize configuration, noting that and requesting shutdown but still continue initialisation and getting a panic from using a nil.

jbygdell · 2023-01-26T08:53:50Z

Now that I actually manage to think; (generally) execution needs to be paused in the problematic thread as well.

Otherwise we risk e.g. failing to initialize configuration, noting that and requesting shutdown but still continue initialisation and getting a panic from using a nil.

Replacing the shutdown() and log.Fatalf() lines in api.go with a log.Errorf() and a SIGINT call will not change anything, since everything is already initiated at this point.

pontus · 2023-01-26T09:20:18Z

Replacing the shutdown() and log.Fatalf() lines in api.go with a log.Errorf() and a SIGINT call will not change anything, since everything is already initiated at this point.

Yes, that's more of a style/consistency issue whereas;

Now that I actually manage to think; (generally) execution needs to be paused in the problematic thread as well.
Otherwise we risk e.g. failing to initialize configuration, noting that and requesting shutdown but still continue initialisation and getting a panic from using a nil.

Is a parallelization issue - it seems certainly plausible that after signalling the shutdown (the channel has room), the thread were the error occurs goes on with the next thing and panics as previous initialization has failed before the goroutine responsible for shutdown gets a chance to read from the channel.

jbygdell · 2023-01-26T12:33:20Z

Now it is nil pointer safe as well

codecov-commenter · 2023-01-26T12:35:22Z

Codecov Report

Merging #517 (ee09417) into master (ac74f63) will decrease coverage by 1.84%.
The diff coverage is 3.04%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##           master     #517      +/-   ##
==========================================
- Coverage   40.48%   38.64%   -1.84%     
==========================================
  Files          13       13              
  Lines        3048     3206     +158     
==========================================
+ Hits         1234     1239       +5     
- Misses       1771     1924     +153     
  Partials       43       43

Flag	Coverage Δ
unittests	`38.64% <3.04%> (-1.84%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
cmd/backup/backup.go	`0.00% <0.00%> (ø)`
cmd/finalize/finalize.go	`0.00% <0.00%> (ø)`
cmd/intercept/intercept.go	`14.38% <0.00%> (-1.53%)`	⬇️
cmd/mapper/mapper.go	`0.00% <0.00%> (ø)`
cmd/notify/notify.go	`51.04% <0.00%> (-6.89%)`	⬇️
cmd/verify/verify.go	`0.00% <0.00%> (ø)`
cmd/ingest/ingest.go	`3.24% <1.81%> (-0.25%)`	⬇️
internal/broker/broker.go	`82.98% <5.26%> (-5.86%)`	⬇️
cmd/api/api.go	`58.25% <50.00%> (-0.09%)`	⬇️
... and 6 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

pontus · 2023-01-26T14:15:33Z

I'm not sure if it's unclear, but the problem I see (#517 (review)) is a race condition, in short I think all instances where there are problems should be e.g.

		log.Error(err)
		sigc <- syscall.SIGINT
   	       <-make(chan bool, 1)

obviously, that requires moving at least the goroutine to do the actual shutdown up top.

It "probably" works anyway, but the way I see it that (or some other way of holding execution in the problematic flow) is needed for correctness.

(It may of course be that I'm wrong, but then it'd be nice to understand how.)

jbygdell · 2023-01-26T14:48:27Z

I'm not sure if it's unclear, but the problem I see (#517 (review)) is a race condition, in short I think all instances where there are problems should be e.g.
		log.Error(err)
		sigc <- syscall.SIGINT
   	       <-make(chan bool, 1)
obviously, that requires moving at least the goroutine to do the actual shutdown up top.

It "probably" works anyway, but the way I see it that (or some other way of holding execution in the problematic flow) is needed for correctness.

(It may of course be that I'm wrong, but then it'd be nice to understand how.)

The removal of log.Fatal() is to make sure that we have time to shutdown cleanly since it is the equivalent of doing log.Error() directly followed by os.Exit(1).

Utilizing the syscall.SIGINT is just a convenient way that also captures external shutdown requests, another way is to call a shutdown() function that would have to look something like this:

func shutdown(mq *broker.AMQPBroker, db *database.SQLdb) {
	if mq != nil {
		defer mq.Channel.Close()
		defer mq.Connection.Close()
	}
	if db != nil {
		defer db.Close()
	}
	os.Exit(1)
}

Given that all of this happens before the main work function is started it is pretty safe as is.

I don't think this normally would break in practice so I don't want to hold it up but still do not consider it correct enough that I want to approve it.

aaperis · 2023-01-31T15:04:08Z

The changes look reasonable to me but I would like to see some test cases, preferably in the form of code. This would help have a better picture of what to expect, given that we would like the pipeline's behavior to be as predictable as possible.

jbygdell · 2023-02-02T07:57:38Z

The changes look reasonable to me but I would like to see some test cases, preferably in the form of code. This would help have a better picture of what to expect, given that we would like the pipeline's behavior to be as predictable as possible.

Hard to have a coded test case that simulates an external kill -2 PID, especially since we can't test the main function as it is written now.
The closest we can get is to stop the container in the integration test and verify that we don't see any error in the DB or MQ logs client unexpectedly closed TCP connection

blankdots

Does this fix #412 if not i suspect we might need to rewrite a bit some of the mq graceful shutdown for that as well.
there is an mention here: https://github.com/rabbitmq/amqp091-go/blob/main/doc.go#L107-L146

pontus · 2023-02-09T15:00:29Z

I don't see that this should have any bearing on that - but then again, the suggested thing is already done and if an error is detected, the service bails out, so I'm not sure what happens when that stall occurs.

jbygdell · 2023-02-09T15:10:02Z

Should be an easy fix since we already has the functionality for connection errors

blankdots · 2023-02-09T15:10:52Z

i did encounter it a couple of times, but was not able to reproduce it at all consistently.

rabbitmq/amqp091-go#123 is the only related thing i have about it

blankdots · 2023-02-09T15:12:20Z

the error looks like: streadway/amqp#518

jbygdell · 2023-02-10T13:45:43Z

Rebased on main

jbygdell · 2023-04-21T10:57:04Z

Converting to draft since it probably will not have time to be implemented before we merge the repos.

jbygdell · 2023-08-14T08:07:31Z

Will be implemented in the merged repo.

jbygdell requested a review from a team January 24, 2023 12:40

jbygdell self-assigned this Jan 24, 2023

blankdots approved these changes Jan 24, 2023

View reviewed changes

pontus approved these changes Jan 24, 2023

View reviewed changes

pontus previously requested changes Jan 26, 2023

View reviewed changes

jbygdell requested a review from pontus January 26, 2023 13:50

jbygdell requested a review from a team January 31, 2023 08:35

blankdots approved these changes Jan 31, 2023

View reviewed changes

norling approved these changes Feb 6, 2023

View reviewed changes

jbygdell requested a review from a team February 9, 2023 13:18

blankdots reviewed Feb 9, 2023

View reviewed changes

pontus mentioned this pull request Feb 10, 2023

Fix MQ connectivity #412

Closed

2 tasks

jbygdell force-pushed the graceful_shutdown branch from 84c8092 to ee09417 Compare February 10, 2023 13:45

jbygdell force-pushed the graceful_shutdown branch from ee09417 to f1ed72a Compare February 10, 2023 15:54

jbygdell requested review from aaperis and dbampalikis February 14, 2023 07:16

jbygdell added 7 commits February 27, 2023 12:16

Gracefully handle interrupt signals

d1a374b

[cmd] return with no blank line before (nlreturn)

e6d0634

[cmd] more linter isues

d51d9ae

[cmd] log.Fatal is safe when reading config

f68dde8

[cmd] only try to close active connections

e3d56f1

Call panic on error to stop further execution

6ba6203

Broker react to channel closures

e48834b

jbygdell force-pushed the graceful_shutdown branch from f1ed72a to e48834b Compare February 27, 2023 11:19

jbygdell marked this pull request as draft April 21, 2023 10:56

jbygdell closed this Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful shutdown #517

Graceful shutdown #517

jbygdell commented Jan 24, 2023

pontus commented Jan 24, 2023

jbygdell commented Jan 25, 2023

pontus left a comment

jbygdell commented Jan 26, 2023

pontus commented Jan 26, 2023

jbygdell commented Jan 26, 2023

codecov-commenter commented Jan 26, 2023 •

edited

Loading

pontus commented Jan 26, 2023

jbygdell commented Jan 26, 2023

aaperis commented Jan 31, 2023

jbygdell commented Feb 2, 2023

blankdots left a comment

pontus commented Feb 9, 2023

jbygdell commented Feb 9, 2023

blankdots commented Feb 9, 2023

blankdots commented Feb 9, 2023

jbygdell commented Feb 10, 2023

jbygdell commented Apr 21, 2023

jbygdell commented Aug 14, 2023

Graceful shutdown #517

Graceful shutdown #517

Conversation

jbygdell commented Jan 24, 2023

pontus commented Jan 24, 2023

jbygdell commented Jan 25, 2023

pontus left a comment

Choose a reason for hiding this comment

jbygdell commented Jan 26, 2023

pontus commented Jan 26, 2023

jbygdell commented Jan 26, 2023

codecov-commenter commented Jan 26, 2023 • edited Loading

Codecov Report

pontus commented Jan 26, 2023

jbygdell commented Jan 26, 2023

aaperis commented Jan 31, 2023

jbygdell commented Feb 2, 2023

blankdots left a comment

Choose a reason for hiding this comment

pontus commented Feb 9, 2023

jbygdell commented Feb 9, 2023

blankdots commented Feb 9, 2023

blankdots commented Feb 9, 2023

jbygdell commented Feb 10, 2023

jbygdell commented Apr 21, 2023

jbygdell commented Aug 14, 2023

codecov-commenter commented Jan 26, 2023 •

edited

Loading