Machine ID: Add Prometheus metrics for loop tasks#52410
Conversation
This adds a number of Prometheus metrics to help track success, failure, and timing for loop iterations. The loop helper is used across tbot services, so these metrics universally cover identity and output renewals, among other tasks. Also, renames `service_heatbeat.go`, which was misspelled.
|
Sample of new metrics: |
There was a problem hiding this comment.
One issue here is that all of our output services use output-renewal as their task name, so they'll be grouped together. Do we want to make that more specific? I'd suggest either appending something more specific to the name (e.g. output-renewal/application) or adding a subtype field + prometheus label.
There was a problem hiding this comment.
Yeah - I think we can probably give these all more specific names I think. Perhaps we just leverage the service name first?
There was a problem hiding this comment.
Good call, I've plumbed through "service" as an additional label. The stringer has a mild caveat of including filepaths in the label value sometimes, though, so I'm tempted to replace .String() with config.FooServiceType constants? The cardinality isn't likely to be a huge issue and keeps individual outputs separate ... but it feels gross.
| Buckets: []float64{0, 1, 2, 3, 4, 5}, | ||
| }, []string{"name"}, | ||
| ) | ||
| loopIterationsFailureCounter = prometheus.NewCounterVec( |
There was a problem hiding this comment.
I tend to see this grouped with tbot_task_iterations using some kind of label to indicate status.
There was a problem hiding this comment.
I think since we're tracking successful iterations via a histogram, the failed state - which would've otherwise been grouped into a labelled counter alongside successful - gets left out on its own. If you'd like, I could group them anyway and record tbot_task_iterations{status="successful"} as a duplicate of tbot_task_iterations_successful_count? (That one's recorded automatically as part of the histogram)
I've at least renamed tbot_task_iterations to tbot_task_iterations_total since I think that's a bit more in line with Prometheus conventions. That one definitely needs to remain separate, I think.
There was a problem hiding this comment.
I figured that the histogram could also have a "state" label - but this is fine how it is if you'd rather proceed as is.
|
@timothyb89 See the table below for backport results.
|
* Machine ID: Add Prometheus metrics for loop tasks This adds a number of Prometheus metrics to help track success, failure, and timing for loop iterations. The loop helper is used across tbot services, so these metrics universally cover identity and output renewals, among other tasks. Also, renames `service_heatbeat.go`, which was misspelled. * Include service name as a label; rename metrics for conventions
* Machine ID: Add Prometheus metrics for loop tasks This adds a number of Prometheus metrics to help track success, failure, and timing for loop iterations. The loop helper is used across tbot services, so these metrics universally cover identity and output renewals, among other tasks. Also, renames `service_heatbeat.go`, which was misspelled. * Include service name as a label; rename metrics for conventions
This adds a number of Prometheus metrics to help track success, failure, and timing for loop iterations. The loop helper is used across tbot services, so these metrics universally cover identity and output renewals, among other tasks. Also, included the Teleport build collector.
New metrics include:
tbot_task_iteration_duration_seconds: histogram of iteration time, including all retriestbot_task_iterations_successful: histogram of # of attempts needed for a particular iteration to succeedtbot_task_iterations_failed: count of failures by tasktbot_task_iterations: simple counter of iterations attempted per task, regardless of outcomeThis additionally renames
service_heatbeat.go, which was misspelled.changelog: Machine ID: Added new Prometheus metrics to track success and failure of renewal loops