Machine ID: Add Prometheus metrics for loop tasks by timothyb89 · Pull Request #52410 · gravitational/teleport

timothyb89 · 2025-02-22T02:50:51Z

This adds a number of Prometheus metrics to help track success, failure, and timing for loop iterations. The loop helper is used across tbot services, so these metrics universally cover identity and output renewals, among other tasks. Also, included the Teleport build collector.

New metrics include:

tbot_task_iteration_duration_seconds: histogram of iteration time, including all retries
tbot_task_iterations_successful: histogram of # of attempts needed for a particular iteration to succeed
tbot_task_iterations_failed: count of failures by task
tbot_task_iterations: simple counter of iterations attempted per task, regardless of outcome

This additionally renames service_heatbeat.go, which was misspelled.

changelog: Machine ID: Added new Prometheus metrics to track success and failure of renewal loops

This adds a number of Prometheus metrics to help track success, failure, and timing for loop iterations. The loop helper is used across tbot services, so these metrics universally cover identity and output renewals, among other tasks. Also, renames `service_heatbeat.go`, which was misspelled.

timothyb89 · 2025-02-22T02:52:18Z

Sample of new metrics:

# HELP tbot_task_iteration_duration_seconds Time between beginning and ultimate end of one task iteration regardless of outcome, including all retries
# TYPE tbot_task_iteration_duration_seconds histogram
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="0.1"} 0
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="0.17500000000000002"} 0
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="0.30625"} 1
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="0.5359375000000001"} 1
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="0.9378906250000001"} 1
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="1.6413085937500003"} 1
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="+Inf"} 1
tbot_task_iteration_duration_seconds_sum{name="bot-identity-renewal"} 0.230272
tbot_task_iteration_duration_seconds_count{name="bot-identity-renewal"} 1
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="0.1"} 0
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="0.17500000000000002"} 0
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="0.30625"} 2
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="0.5359375000000001"} 2
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="0.9378906250000001"} 2
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="1.6413085937500003"} 2
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="+Inf"} 2
tbot_task_iteration_duration_seconds_sum{name="output-renewal"} 0.407756875
tbot_task_iteration_duration_seconds_count{name="output-renewal"} 2
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="0.1"} 1
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="0.17500000000000002"} 1
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="0.30625"} 1
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="0.5359375000000001"} 1
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="0.9378906250000001"} 1
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="1.6413085937500003"} 1
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="+Inf"} 1
tbot_task_iteration_duration_seconds_sum{name="submit-heartbeat"} 0.031338792
tbot_task_iteration_duration_seconds_count{name="submit-heartbeat"} 1
# HELP tbot_task_iterations Number of task iteration attempts, not counting retries
# TYPE tbot_task_iterations counter
tbot_task_iterations{name="bot-identity-renewal"} 1
tbot_task_iterations{name="output-renewal"} 2
tbot_task_iterations{name="submit-heartbeat"} 1
# HELP tbot_task_iterations_successful Histogram of task iterations that ultimately succeeded, bucketed by number of retries before success
# TYPE tbot_task_iterations_successful histogram
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="0"} 1
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="1"} 1
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="2"} 1
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="3"} 1
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="4"} 1
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="5"} 1
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="+Inf"} 1
tbot_task_iterations_successful_sum{name="bot-identity-renewal"} 0
tbot_task_iterations_successful_count{name="bot-identity-renewal"} 1
tbot_task_iterations_successful_bucket{name="output-renewal",le="0"} 2
tbot_task_iterations_successful_bucket{name="output-renewal",le="1"} 2
tbot_task_iterations_successful_bucket{name="output-renewal",le="2"} 2
tbot_task_iterations_successful_bucket{name="output-renewal",le="3"} 2
tbot_task_iterations_successful_bucket{name="output-renewal",le="4"} 2
tbot_task_iterations_successful_bucket{name="output-renewal",le="5"} 2
tbot_task_iterations_successful_bucket{name="output-renewal",le="+Inf"} 2
tbot_task_iterations_successful_sum{name="output-renewal"} 0
tbot_task_iterations_successful_count{name="output-renewal"} 2
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="0"} 1
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="1"} 1
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="2"} 1
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="3"} 1
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="4"} 1
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="5"} 1
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="+Inf"} 1
tbot_task_iterations_successful_sum{name="submit-heartbeat"} 0
tbot_task_iterations_successful_count{name="submit-heartbeat"} 1
# HELP teleport_build_info Provides build information of Teleport including gitref (git describe --long --tags), Go version, and Teleport version. The value of this gauge will always be 1.
# TYPE teleport_build_info gauge
teleport_build_info{gitref="api/v17.0.0-dev.gusr.1-2795-g75cc82e38e",goversion="go1.24.0",version="18.0.0-dev"} 1

timothyb89 · 2025-02-22T02:57:03Z

One issue here is that all of our output services use output-renewal as their task name, so they'll be grouped together. Do we want to make that more specific? I'd suggest either appending something more specific to the name (e.g. output-renewal/application) or adding a subtype field + prometheus label.

Yeah - I think we can probably give these all more specific names I think. Perhaps we just leverage the service name first?

Good call, I've plumbed through "service" as an additional label. The stringer has a mild caveat of including filepaths in the label value sometimes, though, so I'm tempted to replace .String() with config.FooServiceType constants? The cardinality isn't likely to be a huge issue and keeps individual outputs separate ... but it feels gross.

strideynet · 2025-02-24T10:57:21Z

+			Buckets: []float64{0, 1, 2, 3, 4, 5},
+		}, []string{"name"},
+	)
+	loopIterationsFailureCounter = prometheus.NewCounterVec(


I tend to see this grouped with tbot_task_iterations using some kind of label to indicate status.

I think since we're tracking successful iterations via a histogram, the failed state - which would've otherwise been grouped into a labelled counter alongside successful - gets left out on its own. If you'd like, I could group them anyway and record tbot_task_iterations{status="successful"} as a duplicate of tbot_task_iterations_successful_count? (That one's recorded automatically as part of the histogram)

I've at least renamed tbot_task_iterations to tbot_task_iterations_total since I think that's a bit more in line with Prometheus conventions. That one definitely needs to remain separate, I think.

I figured that the histogram could also have a "state" label - but this is fine how it is if you'd rather proceed as is.

public-teleport-github-review-bot · 2025-02-26T01:36:37Z

@timothyb89 See the table below for backport results.

Branch	Result
branch/v16	Failed
branch/v17	Create PR

* Machine ID: Add Prometheus metrics for loop tasks This adds a number of Prometheus metrics to help track success, failure, and timing for loop iterations. The loop helper is used across tbot services, so these metrics universally cover identity and output renewals, among other tasks. Also, renames `service_heatbeat.go`, which was misspelled. * Include service name as a label; rename metrics for conventions

timothyb89 added the machine-id label Feb 22, 2025

github-actions bot added the size/sm label Feb 22, 2025

github-actions bot requested review from boxofrad and strideynet February 22, 2025 02:51

timothyb89 commented Feb 22, 2025

View reviewed changes

strideynet reviewed Feb 24, 2025

View reviewed changes

boxofrad approved these changes Feb 24, 2025

View reviewed changes

Include service name as a label; rename metrics for conventions

65ef262

strideynet approved these changes Feb 25, 2025

View reviewed changes

timothyb89 added backport/branch/v16 backport/branch/v17 labels Feb 26, 2025

timothyb89 added this pull request to the merge queue Feb 26, 2025

Merged via the queue into master with commit 8b9c3fa Feb 26, 2025

timothyb89 deleted the timothyb89/tbot-loop-prometheus-metrics branch February 26, 2025 01:34

timothyb89 mentioned this pull request Feb 26, 2025

[v17] Machine ID: Add Prometheus metrics for loop tasks #52496

Merged

timothyb89 mentioned this pull request Mar 4, 2025

[v16] Machine ID: Add Prometheus metrics for loop tasks (#52410) #52729

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine ID: Add Prometheus metrics for loop tasks#52410

Machine ID: Add Prometheus metrics for loop tasks#52410
timothyb89 merged 2 commits intomasterfrom
timothyb89/tbot-loop-prometheus-metrics

timothyb89 commented Feb 22, 2025 •

edited

Loading

Uh oh!

timothyb89 commented Feb 22, 2025

Uh oh!

timothyb89 Feb 22, 2025

Uh oh!

strideynet Feb 24, 2025

Uh oh!

timothyb89 Feb 25, 2025

Uh oh!

strideynet Feb 24, 2025

Uh oh!

timothyb89 Feb 25, 2025

Uh oh!

strideynet Feb 25, 2025

Uh oh!

public-teleport-github-review-bot bot commented Feb 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

timothyb89 commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timothyb89 commented Feb 22, 2025

Uh oh!

timothyb89 Feb 22, 2025

Choose a reason for hiding this comment

Uh oh!

strideynet Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

timothyb89 Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

strideynet Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

timothyb89 Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

strideynet Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

public-teleport-github-review-bot bot commented Feb 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

timothyb89 commented Feb 22, 2025 •

edited

Loading