Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master babysit retry #5036

Merged
merged 5 commits into from
Sep 23, 2024
Merged

Master babysit retry #5036

merged 5 commits into from
Sep 23, 2024

Conversation

brong
Copy link
Member

@brong brong commented Sep 15, 2024

I recently moved sync_client to be managed by Cyrus rather than a separate "monitorsync" script.

Likewise, we have other things started under the DAEMON block which are supposed to run forever at Fastmail:

pusher, calalarmd, squatter, idled, bayesrpcd and one day again: promstatd.

This change means that if we hit the restart limit, we retry every 10 seconds forever for anything with the 'babysit' flag - which includes all daemons. We only try once per 10 seconds unless we get a clean 10 seconds, in which case it resets back to allowing 5 failures with immediate restart. This should stop it causing a totally crazy amount of restarts.

Finally: I added the string "ERROR" to the messages about hitting the limits, which will help with Fastmail's log tracking.

Even if they're dying, we want to keep trying - for any
temporary condition, this is better than waiting forever.
@brong brong requested a review from elliefm September 15, 2024 11:19
Copy link
Contributor

@elliefm elliefm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay, one nit in the test update. I wonder if we can test the new behaviour somehow.

Comment on lines 535 to 537
xlog $self, "check that the error was syslogged";
my @loglines = $self->{instance}->getsyslog();
$self->assert(grep { m/too many failures for service/ } @loglines);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will give false failures if the syslog replacement injection had failed. Have a look at $self->assert_syslog_matches()... I can't remember if it's exactly suitable here, but if it's not, at least the implementation will show the right way to handle this sort of thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, can probably just use that :) Thanks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this!

We can test new behaviour by hand by killing something every second and watching the logs

@brong brong requested a review from elliefm September 16, 2024 02:01
@brong brong merged commit 9152608 into cyrusimap:master Sep 23, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants