Skip to content
This repository has been archived by the owner on Nov 14, 2023. It is now read-only.

Graceful shutdown #517

Closed
wants to merge 7 commits into from
Closed

Graceful shutdown #517

wants to merge 7 commits into from

Conversation

jbygdell
Copy link
Collaborator

  • handle interrupt signals from the OS
  • remove all log.Fatal to make sure we exit cleanly in all cases.
  • linter fixes - return with no blank line before (nlreturn)

@jbygdell jbygdell requested a review from a team January 24, 2023 12:40
@jbygdell jbygdell self-assigned this Jan 24, 2023
@pontus
Copy link
Contributor

pontus commented Jan 24, 2023

Looks good.

Some possible considerations:

  • Would it make sense to log when exiting due to signal received (even though the "internal" signals might be confusing)?
  • cmd/api/api.go has a few log.Fatalln - should those also be replaced?
  • Maybe the signal handler is called enough that it's worth making it shared (supposedly being passwed mq, db or possibly nil if either isn't used)?

@jbygdell
Copy link
Collaborator Author

Looks good.

Some possible considerations:

  • Would it make sense to log when exiting due to signal received (even though the "internal" signals might be confusing)?
  • cmd/api/api.go has a few log.Fatalln - should those also be replaced?

Since we call shutdown() befor calling log.Fatalln we ensure that the DB and MQ connections are closed cleanly

  • Maybe the signal handler is called enough that it's worth making it shared (supposedly being passwed mq, db or possibly nil if either isn't used)?

pontus
pontus previously requested changes Jan 26, 2023
Copy link
Contributor

@pontus pontus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I actually manage to think; (generally) execution needs to be paused in the problematic thread as well.

Otherwise we risk e.g. failing to initialize configuration, noting that and requesting shutdown but still continue initialisation and getting a panic from using a nil.

@jbygdell
Copy link
Collaborator Author

Now that I actually manage to think; (generally) execution needs to be paused in the problematic thread as well.

Otherwise we risk e.g. failing to initialize configuration, noting that and requesting shutdown but still continue initialisation and getting a panic from using a nil.

Replacing the shutdown() and log.Fatalf() lines in api.go with a log.Errorf() and a SIGINT call will not change anything, since everything is already initiated at this point.

@pontus
Copy link
Contributor

pontus commented Jan 26, 2023

Replacing the shutdown() and log.Fatalf() lines in api.go with a log.Errorf() and a SIGINT call will not change anything, since everything is already initiated at this point.

Yes, that's more of a style/consistency issue whereas;

Now that I actually manage to think; (generally) execution needs to be paused in the problematic thread as well.
Otherwise we risk e.g. failing to initialize configuration, noting that and requesting shutdown but still continue initialisation and getting a panic from using a nil.

Is a parallelization issue - it seems certainly plausible that after signalling the shutdown (the channel has room), the thread were the error occurs goes on with the next thing and panics as previous initialization has failed before the goroutine responsible for shutdown gets a chance to read from the channel.

@jbygdell
Copy link
Collaborator Author

Now it is nil pointer safe as well

@codecov-commenter
Copy link

codecov-commenter commented Jan 26, 2023

Codecov Report

Merging #517 (ee09417) into master (ac74f63) will decrease coverage by 1.84%.
The diff coverage is 3.04%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##           master     #517      +/-   ##
==========================================
- Coverage   40.48%   38.64%   -1.84%     
==========================================
  Files          13       13              
  Lines        3048     3206     +158     
==========================================
+ Hits         1234     1239       +5     
- Misses       1771     1924     +153     
  Partials       43       43              
Flag Coverage Δ
unittests 38.64% <3.04%> (-1.84%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
cmd/backup/backup.go 0.00% <0.00%> (ø)
cmd/finalize/finalize.go 0.00% <0.00%> (ø)
cmd/intercept/intercept.go 14.38% <0.00%> (-1.53%) ⬇️
cmd/mapper/mapper.go 0.00% <0.00%> (ø)
cmd/notify/notify.go 51.04% <0.00%> (-6.89%) ⬇️
cmd/verify/verify.go 0.00% <0.00%> (ø)
cmd/ingest/ingest.go 3.24% <1.81%> (-0.25%) ⬇️
internal/broker/broker.go 82.98% <5.26%> (-5.86%) ⬇️
cmd/api/api.go 58.25% <50.00%> (-0.09%) ⬇️
... and 6 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@jbygdell jbygdell requested a review from pontus January 26, 2023 13:50
@pontus
Copy link
Contributor

pontus commented Jan 26, 2023

I'm not sure if it's unclear, but the problem I see (#517 (review)) is a race condition, in short I think all instances where there are problems should be e.g.

		log.Error(err)
		sigc <- syscall.SIGINT
   	       <-make(chan bool, 1)

obviously, that requires moving at least the goroutine to do the actual shutdown up top.

It "probably" works anyway, but the way I see it that (or some other way of holding execution in the problematic flow) is needed for correctness.

(It may of course be that I'm wrong, but then it'd be nice to understand how.)

@jbygdell
Copy link
Collaborator Author

I'm not sure if it's unclear, but the problem I see (#517 (review)) is a race condition, in short I think all instances where there are problems should be e.g.

		log.Error(err)
		sigc <- syscall.SIGINT
   	       <-make(chan bool, 1)

obviously, that requires moving at least the goroutine to do the actual shutdown up top.

It "probably" works anyway, but the way I see it that (or some other way of holding execution in the problematic flow) is needed for correctness.

(It may of course be that I'm wrong, but then it'd be nice to understand how.)

The removal of log.Fatal() is to make sure that we have time to shutdown cleanly since it is the equivalent of doing log.Error() directly followed by os.Exit(1).

Utilizing the syscall.SIGINT is just a convenient way that also captures external shutdown requests, another way is to call a shutdown() function that would have to look something like this:

func shutdown(mq *broker.AMQPBroker, db *database.SQLdb) {
	if mq != nil {
		defer mq.Channel.Close()
		defer mq.Connection.Close()
	}
	if db != nil {
		defer db.Close()
	}
	os.Exit(1)
}

Given that all of this happens before the main work function is started it is pretty safe as is.

@pontus pontus dismissed their stale review January 31, 2023 08:03

I don't think this normally would break in practice so I don't want to hold it up but still do not consider it correct enough that I want to approve it.

@jbygdell jbygdell requested a review from a team January 31, 2023 08:35
@aaperis
Copy link
Contributor

aaperis commented Jan 31, 2023

The changes look reasonable to me but I would like to see some test cases, preferably in the form of code. This would help have a better picture of what to expect, given that we would like the pipeline's behavior to be as predictable as possible.

@jbygdell
Copy link
Collaborator Author

jbygdell commented Feb 2, 2023

The changes look reasonable to me but I would like to see some test cases, preferably in the form of code. This would help have a better picture of what to expect, given that we would like the pipeline's behavior to be as predictable as possible.

Hard to have a coded test case that simulates an external kill -2 PID, especially since we can't test the main function as it is written now.
The closest we can get is to stop the container in the integration test and verify that we don't see any error in the DB or MQ logs client unexpectedly closed TCP connection

@jbygdell jbygdell requested a review from a team February 9, 2023 13:18
Copy link
Contributor

@blankdots blankdots left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this fix #412 if not i suspect we might need to rewrite a bit some of the mq graceful shutdown for that as well.
there is an mention here: https://github.com/rabbitmq/amqp091-go/blob/main/doc.go#L107-L146

@pontus
Copy link
Contributor

pontus commented Feb 9, 2023

I don't see that this should have any bearing on that - but then again, the suggested thing is already done and if an error is detected, the service bails out, so I'm not sure what happens when that stall occurs.

@jbygdell
Copy link
Collaborator Author

jbygdell commented Feb 9, 2023

Should be an easy fix since we already has the functionality for connection errors

@blankdots
Copy link
Contributor

i did encounter it a couple of times, but was not able to reproduce it at all consistently.

rabbitmq/amqp091-go#123 is the only related thing i have about it

@blankdots
Copy link
Contributor

the error looks like: streadway/amqp#518

@pontus pontus mentioned this pull request Feb 10, 2023
2 tasks
@jbygdell
Copy link
Collaborator Author

Rebased on main

@jbygdell jbygdell marked this pull request as draft April 21, 2023 10:56
@jbygdell
Copy link
Collaborator Author

Converting to draft since it probably will not have time to be implemented before we merge the repos.

@jbygdell
Copy link
Collaborator Author

Will be implemented in the merged repo.

@jbygdell jbygdell closed this Aug 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants