Refactor alerts to distinguish between infrastructure failure, capacity, or service limits #1458

mohdnr · 2022-02-08T15:57:28Z

Candidates for refactoring:

logs-10-celery-error-1-minute-critical: This is currently tracking "?\"ERROR/Worker\" ?\"ERROR/ForkPoolWorker\" ?\"WorkerLostError\"" found in cloudwatch eks-cluster/application logs. This is too generic. Identify a way to distinguish between intentional thrown errors (message limits) and legitimate celery failures.

The text was updated successfully, but these errors were encountered:

jimleroyer · 2022-02-08T19:53:20Z

I'd like to know more what you have in mind but I am unsure that these described filters are problematic or too generic, as they don't target infra, nor capacity or service limits but tells us when a worker fails. For example, if an email could not be sent due to a job failure.

mohdnr · 2022-02-09T13:09:23Z

@jimleroyer the alert generated by this alarm was deemed to be caused by a service exceeding their limit. I was curious to investigate the relationship between that happening and this alarm triggering.

jimleroyer · 2022-02-10T04:25:29Z

Oh true, I reviewed some of the alerts of past week and I realized the service rate limit did indeed triggered from a celery worker alert. Let's refine it so at least the rate limit goes into their own alert (warning level probably) and not trigger a Celery worker alert.

patheard · 2022-04-25T12:13:06Z

Another one that could be filtered out is the psycopg2.errors.UniqueViolation error since a small number of those are expected.

Here’s what it would look like to stop psycopg2.errors.UniqueViolation errors from being counted by the celery-error filter:

[(err="*ERROR/Worker*" || err="*ERROR/ForkPoolWorker*" || err="*WorkerLostError*") && err!="*psycopg2.errors.UniqueViolation*"]

You could use this same pattern to tune out any Celery errors that are expected in low quantities.

jimleroyer · 2022-04-25T13:43:21Z

@patheard We might want to keep an alarm on high volume of unique violation / duplicate key error. So we could have an alarm filtering these out completely as you suggested and a dedicated one for these with a proper threshold on a period of time.

mohdnr added the Tech Debt Tasks to reduce technical debt label Feb 8, 2022

mohdnr assigned patheard and mohdnr Feb 8, 2022

jimleroyer mentioned this issue Apr 25, 2022

Reduce alarms noise filtering out duplicate key errors cds-snc/notification-planning#626

Open

4 tasks

patheard unassigned patheard and mohdnr May 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor alerts to distinguish between infrastructure failure, capacity, or service limits #1458

Refactor alerts to distinguish between infrastructure failure, capacity, or service limits #1458

mohdnr commented Feb 8, 2022 •

edited

Loading

jimleroyer commented Feb 8, 2022

mohdnr commented Feb 9, 2022

jimleroyer commented Feb 10, 2022 •

edited

Loading

patheard commented Apr 25, 2022 •

edited

Loading

jimleroyer commented Apr 25, 2022

Refactor alerts to distinguish between infrastructure failure, capacity, or service limits #1458

Refactor alerts to distinguish between infrastructure failure, capacity, or service limits #1458

Comments

mohdnr commented Feb 8, 2022 • edited Loading

jimleroyer commented Feb 8, 2022

mohdnr commented Feb 9, 2022

jimleroyer commented Feb 10, 2022 • edited Loading

patheard commented Apr 25, 2022 • edited Loading

jimleroyer commented Apr 25, 2022

mohdnr commented Feb 8, 2022 •

edited

Loading

jimleroyer commented Feb 10, 2022 •

edited

Loading

patheard commented Apr 25, 2022 •

edited

Loading