-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Master babysit retry #5036
Master babysit retry #5036
Conversation
Even if they're dying, we want to keep trying - for any temporary condition, this is better than waiting forever.
This helps with log monitoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay, one nit in the test update. I wonder if we can test the new behaviour somehow.
xlog $self, "check that the error was syslogged"; | ||
my @loglines = $self->{instance}->getsyslog(); | ||
$self->assert(grep { m/too many failures for service/ } @loglines); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will give false failures if the syslog replacement injection had failed. Have a look at $self->assert_syslog_matches()
... I can't remember if it's exactly suitable here, but if it's not, at least the implementation will show the right way to handle this sort of thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, can probably just use that :) Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this!
We can test new behaviour by hand by killing something every second and watching the logs
I recently moved sync_client to be managed by Cyrus rather than a separate "monitorsync" script.
Likewise, we have other things started under the DAEMON block which are supposed to run forever at Fastmail:
pusher, calalarmd, squatter, idled, bayesrpcd and one day again: promstatd.
This change means that if we hit the restart limit, we retry every 10 seconds forever for anything with the 'babysit' flag - which includes all daemons. We only try once per 10 seconds unless we get a clean 10 seconds, in which case it resets back to allowing 5 failures with immediate restart. This should stop it causing a totally crazy amount of restarts.
Finally: I added the string "ERROR" to the messages about hitting the limits, which will help with Fastmail's log tracking.