-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gardener_jobs_total error rate to error panel #934
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a part of me that feels like this might not belong with the other two time series. It's taking the daily rate (not 5 min increase) and the legend shows the datatype (not the status).
Reviewable status:
complete! 1 of 1 approvals obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I've updated the time ranges to match, to 10m for all three.
I think the datatype error is more helpful than the other weird status errors from gardener. But, I wanted to surface the weird status errors since they are related to the root causes of some failures. Perhaps a second panel?
Reviewable status:
complete! 1 of 1 approvals obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. My only observation is that the legend values are not really homogeneous.
I guess it will make sense to whoever looks at the queries anyway. You can add a second panel or leave them in the same one.
Reviewable status:
complete! 1 of 1 approvals obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree these legend values are strange. In my mind making this more visible encourages fixing the underlying source more than keeping them hidden; I only discovered them recently. And, some interpretation is required by the viewer until they are more helpful. I'll leave the queries in one panel for now, and acknowledge that there is room for improvement here. My main goal for these changes was just to surface more information first and refine second. Prior to #933 this dashboard had no visible gardener errors which made it less helpful for serendipitously noticing issues or investigating known issues.
Reviewable status:
complete! 1 of 1 approvals obtained
This change adds a new time series to the gardener error panel added in #933. The new time series matches the gardener jobs total error rate used in the
GardenerFailureRateTooHighOrMissing
alert.This change is![Reviewable](https://camo.githubusercontent.com/1541c4039185914e83657d3683ec25920c672c6c5c7ab4240ee7bff601adec0b/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)