Skip to content

Machine ID: Add Prometheus metrics for loop tasks#52410

Merged
timothyb89 merged 2 commits intomasterfrom
timothyb89/tbot-loop-prometheus-metrics
Feb 26, 2025
Merged

Machine ID: Add Prometheus metrics for loop tasks#52410
timothyb89 merged 2 commits intomasterfrom
timothyb89/tbot-loop-prometheus-metrics

Conversation

@timothyb89
Copy link
Copy Markdown
Contributor

@timothyb89 timothyb89 commented Feb 22, 2025

This adds a number of Prometheus metrics to help track success, failure, and timing for loop iterations. The loop helper is used across tbot services, so these metrics universally cover identity and output renewals, among other tasks. Also, included the Teleport build collector.

New metrics include:

  • tbot_task_iteration_duration_seconds: histogram of iteration time, including all retries
  • tbot_task_iterations_successful: histogram of # of attempts needed for a particular iteration to succeed
  • tbot_task_iterations_failed: count of failures by task
  • tbot_task_iterations: simple counter of iterations attempted per task, regardless of outcome

This additionally renames service_heatbeat.go, which was misspelled.

changelog: Machine ID: Added new Prometheus metrics to track success and failure of renewal loops

This adds a number of Prometheus metrics to help track success,
failure, and timing for loop iterations. The loop helper is used
across tbot services, so these metrics universally cover identity
and output renewals, among other tasks.

Also, renames `service_heatbeat.go`, which was misspelled.
@timothyb89
Copy link
Copy Markdown
Contributor Author

Sample of new metrics:

# HELP tbot_task_iteration_duration_seconds Time between beginning and ultimate end of one task iteration regardless of outcome, including all retries
# TYPE tbot_task_iteration_duration_seconds histogram
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="0.1"} 0
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="0.17500000000000002"} 0
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="0.30625"} 1
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="0.5359375000000001"} 1
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="0.9378906250000001"} 1
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="1.6413085937500003"} 1
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="+Inf"} 1
tbot_task_iteration_duration_seconds_sum{name="bot-identity-renewal"} 0.230272
tbot_task_iteration_duration_seconds_count{name="bot-identity-renewal"} 1
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="0.1"} 0
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="0.17500000000000002"} 0
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="0.30625"} 2
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="0.5359375000000001"} 2
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="0.9378906250000001"} 2
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="1.6413085937500003"} 2
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="+Inf"} 2
tbot_task_iteration_duration_seconds_sum{name="output-renewal"} 0.407756875
tbot_task_iteration_duration_seconds_count{name="output-renewal"} 2
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="0.1"} 1
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="0.17500000000000002"} 1
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="0.30625"} 1
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="0.5359375000000001"} 1
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="0.9378906250000001"} 1
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="1.6413085937500003"} 1
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="+Inf"} 1
tbot_task_iteration_duration_seconds_sum{name="submit-heartbeat"} 0.031338792
tbot_task_iteration_duration_seconds_count{name="submit-heartbeat"} 1
# HELP tbot_task_iterations Number of task iteration attempts, not counting retries
# TYPE tbot_task_iterations counter
tbot_task_iterations{name="bot-identity-renewal"} 1
tbot_task_iterations{name="output-renewal"} 2
tbot_task_iterations{name="submit-heartbeat"} 1
# HELP tbot_task_iterations_successful Histogram of task iterations that ultimately succeeded, bucketed by number of retries before success
# TYPE tbot_task_iterations_successful histogram
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="0"} 1
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="1"} 1
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="2"} 1
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="3"} 1
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="4"} 1
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="5"} 1
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="+Inf"} 1
tbot_task_iterations_successful_sum{name="bot-identity-renewal"} 0
tbot_task_iterations_successful_count{name="bot-identity-renewal"} 1
tbot_task_iterations_successful_bucket{name="output-renewal",le="0"} 2
tbot_task_iterations_successful_bucket{name="output-renewal",le="1"} 2
tbot_task_iterations_successful_bucket{name="output-renewal",le="2"} 2
tbot_task_iterations_successful_bucket{name="output-renewal",le="3"} 2
tbot_task_iterations_successful_bucket{name="output-renewal",le="4"} 2
tbot_task_iterations_successful_bucket{name="output-renewal",le="5"} 2
tbot_task_iterations_successful_bucket{name="output-renewal",le="+Inf"} 2
tbot_task_iterations_successful_sum{name="output-renewal"} 0
tbot_task_iterations_successful_count{name="output-renewal"} 2
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="0"} 1
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="1"} 1
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="2"} 1
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="3"} 1
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="4"} 1
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="5"} 1
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="+Inf"} 1
tbot_task_iterations_successful_sum{name="submit-heartbeat"} 0
tbot_task_iterations_successful_count{name="submit-heartbeat"} 1
# HELP teleport_build_info Provides build information of Teleport including gitref (git describe --long --tags), Go version, and Teleport version. The value of this gauge will always be 1.
# TYPE teleport_build_info gauge
teleport_build_info{gitref="api/v17.0.0-dev.gusr.1-2795-g75cc82e38e",goversion="go1.24.0",version="18.0.0-dev"} 1

Comment thread lib/tbot/loop.go
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue here is that all of our output services use output-renewal as their task name, so they'll be grouped together. Do we want to make that more specific? I'd suggest either appending something more specific to the name (e.g. output-renewal/application) or adding a subtype field + prometheus label.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I think we can probably give these all more specific names I think. Perhaps we just leverage the service name first?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I've plumbed through "service" as an additional label. The stringer has a mild caveat of including filepaths in the label value sometimes, though, so I'm tempted to replace .String() with config.FooServiceType constants? The cardinality isn't likely to be a huge issue and keeps individual outputs separate ... but it feels gross.

Comment thread lib/tbot/loop.go
Buckets: []float64{0, 1, 2, 3, 4, 5},
}, []string{"name"},
)
loopIterationsFailureCounter = prometheus.NewCounterVec(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to see this grouped with tbot_task_iterations using some kind of label to indicate status.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since we're tracking successful iterations via a histogram, the failed state - which would've otherwise been grouped into a labelled counter alongside successful - gets left out on its own. If you'd like, I could group them anyway and record tbot_task_iterations{status="successful"} as a duplicate of tbot_task_iterations_successful_count? (That one's recorded automatically as part of the histogram)

I've at least renamed tbot_task_iterations to tbot_task_iterations_total since I think that's a bit more in line with Prometheus conventions. That one definitely needs to remain separate, I think.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured that the histogram could also have a "state" label - but this is fine how it is if you'd rather proceed as is.

@timothyb89 timothyb89 added this pull request to the merge queue Feb 26, 2025
Merged via the queue into master with commit 8b9c3fa Feb 26, 2025
@timothyb89 timothyb89 deleted the timothyb89/tbot-loop-prometheus-metrics branch February 26, 2025 01:34
@public-teleport-github-review-bot
Copy link
Copy Markdown

@timothyb89 See the table below for backport results.

Branch Result
branch/v16 Failed
branch/v17 Create PR

timothyb89 added a commit that referenced this pull request Mar 4, 2025
* Machine ID: Add Prometheus metrics for loop tasks

This adds a number of Prometheus metrics to help track success,
failure, and timing for loop iterations. The loop helper is used
across tbot services, so these metrics universally cover identity
and output renewals, among other tasks.

Also, renames `service_heatbeat.go`, which was misspelled.

* Include service name as a label; rename metrics for conventions
github-merge-queue bot pushed a commit that referenced this pull request Mar 11, 2025
* Machine ID: Add Prometheus metrics for loop tasks

This adds a number of Prometheus metrics to help track success,
failure, and timing for loop iterations. The loop helper is used
across tbot services, so these metrics universally cover identity
and output renewals, among other tasks.

Also, renames `service_heatbeat.go`, which was misspelled.

* Include service name as a label; rename metrics for conventions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants