-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor alerts to distinguish between infrastructure failure, capacity, or service limits #1458
Comments
I'd like to know more what you have in mind but I am unsure that these described filters are problematic or too generic, as they don't target infra, nor capacity or service limits but tells us when a worker fails. For example, if an email could not be sent due to a job failure. |
@jimleroyer the alert generated by this alarm was deemed to be caused by a service exceeding their limit. I was curious to investigate the relationship between that happening and this alarm triggering. |
Oh true, I reviewed some of the alerts of past week and I realized the service rate limit did indeed triggered from a celery worker alert. Let's refine it so at least the rate limit goes into their own alert (warning level probably) and not trigger a Celery worker alert. |
Another one that could be filtered out is the Here’s what it would look like to stop
You could use this same pattern to tune out any Celery errors that are expected in low quantities. |
@patheard We might want to keep an alarm on high volume of unique violation / duplicate key error. So we could have an alarm filtering these out completely as you suggested and a dedicated one for these with a proper threshold on a period of time. |
Candidates for refactoring:
"?\"ERROR/Worker\" ?\"ERROR/ForkPoolWorker\" ?\"WorkerLostError\""
found in cloudwatcheks-cluster/application
logs. This is too generic. Identify a way to distinguish between intentional thrown errors (message limits) and legitimate celery failures.The text was updated successfully, but these errors were encountered: