Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor alerts to distinguish between infrastructure failure, capacity, or service limits #1458

Open
1 task
mohdnr opened this issue Feb 8, 2022 · 5 comments
Labels
Tech Debt Tasks to reduce technical debt

Comments

@mohdnr
Copy link
Contributor

mohdnr commented Feb 8, 2022

Candidates for refactoring:

  • logs-10-celery-error-1-minute-critical: This is currently tracking "?\"ERROR/Worker\" ?\"ERROR/ForkPoolWorker\" ?\"WorkerLostError\"" found in cloudwatch eks-cluster/application logs. This is too generic. Identify a way to distinguish between intentional thrown errors (message limits) and legitimate celery failures.
@mohdnr mohdnr added the Tech Debt Tasks to reduce technical debt label Feb 8, 2022
@jimleroyer
Copy link
Member

I'd like to know more what you have in mind but I am unsure that these described filters are problematic or too generic, as they don't target infra, nor capacity or service limits but tells us when a worker fails. For example, if an email could not be sent due to a job failure.

@mohdnr
Copy link
Contributor Author

mohdnr commented Feb 9, 2022

@jimleroyer the alert generated by this alarm was deemed to be caused by a service exceeding their limit. I was curious to investigate the relationship between that happening and this alarm triggering.

@jimleroyer
Copy link
Member

jimleroyer commented Feb 10, 2022

Oh true, I reviewed some of the alerts of past week and I realized the service rate limit did indeed triggered from a celery worker alert. Let's refine it so at least the rate limit goes into their own alert (warning level probably) and not trigger a Celery worker alert.

@patheard
Copy link
Member

patheard commented Apr 25, 2022

Another one that could be filtered out is the psycopg2.errors.UniqueViolation error since a small number of those are expected.

Here’s what it would look like to stop psycopg2.errors.UniqueViolation errors from being counted by the celery-error filter:

[(err="*ERROR/Worker*" || err="*ERROR/ForkPoolWorker*" || err="*WorkerLostError*") && err!="*psycopg2.errors.UniqueViolation*"]

You could use this same pattern to tune out any Celery errors that are expected in low quantities.

@jimleroyer
Copy link
Member

@patheard We might want to keep an alarm on high volume of unique violation / duplicate key error. So we could have an alarm filtering these out completely as you suggested and a dedicated one for these with a proper threshold on a period of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Tech Debt Tasks to reduce technical debt
Projects
None yet
Development

No branches or pull requests

3 participants