Master babysit retry #5036

brong · 2024-09-15T11:19:53Z

I recently moved sync_client to be managed by Cyrus rather than a separate "monitorsync" script.

Likewise, we have other things started under the DAEMON block which are supposed to run forever at Fastmail:

pusher, calalarmd, squatter, idled, bayesrpcd and one day again: promstatd.

This change means that if we hit the restart limit, we retry every 10 seconds forever for anything with the 'babysit' flag - which includes all daemons. We only try once per 10 seconds unless we get a clean 10 seconds, in which case it resets back to allowing 5 failures with immediate restart. This should stop it causing a totally crazy amount of restarts.

Finally: I added the string "ERROR" to the messages about hitting the limits, which will help with Fastmail's log tracking.

Even if they're dying, we want to keep trying - for any temporary condition, this is better than waiting forever.

This helps with log monitoring

elliefm

Looks okay, one nit in the test update. I wonder if we can test the new behaviour somehow.

elliefm · 2024-09-16T00:25:45Z

cassandane/Cassandane/Cyrus/Master.pm

+    xlog $self, "check that the error was syslogged";
+    my @loglines = $self->{instance}->getsyslog();
+    $self->assert(grep { m/too many failures for service/ } @loglines);


This will give false failures if the syslog replacement injection had failed. Have a look at $self->assert_syslog_matches()... I can't remember if it's exactly suitable here, but if it's not, at least the implementation will show the right way to handle this sort of thing.

yes, can probably just use that :) Thanks

I added this!

We can test new behaviour by hand by killing something every second and watching the logs

brong added 2 commits September 15, 2024 21:15

master: for "daemon" services, try again every 10 seconds

a7a6629

Even if they're dying, we want to keep trying - for any temporary condition, this is better than waiting forever.

master: add the 'ERROR' string to multiple failure errors

e6a4175

This helps with log monitoring

brong requested a review from elliefm September 15, 2024 11:19

brong added 2 commits September 15, 2024 21:21

add a changes file for the master babysit change

72a8771

Master: check for syslog error upon failure

1075fe6

brong added the include-in-fastmail label Sep 15, 2024

elliefm reviewed Sep 16, 2024

View reviewed changes

Master: handle case where we don't have syslog replacement

f573bec

brong requested a review from elliefm September 16, 2024 02:01

elliefm approved these changes Sep 16, 2024

View reviewed changes

ksmurchison approved these changes Sep 16, 2024

View reviewed changes

brong merged commit 9152608 into cyrusimap:master Sep 23, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Master babysit retry #5036

Master babysit retry #5036

brong commented Sep 15, 2024

elliefm left a comment

elliefm Sep 16, 2024

brong Sep 16, 2024

brong Sep 16, 2024

Master babysit retry #5036

Master babysit retry #5036

Conversation

brong commented Sep 15, 2024

elliefm left a comment

Choose a reason for hiding this comment

elliefm Sep 16, 2024

Choose a reason for hiding this comment

brong Sep 16, 2024

Choose a reason for hiding this comment

brong Sep 16, 2024

Choose a reason for hiding this comment