Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 46 additions & 5 deletions lib/tbot/loop.go
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue here is that all of our output services use output-renewal as their task name, so they'll be grouped together. Do we want to make that more specific? I'd suggest either appending something more specific to the name (e.g. output-renewal/application) or adding a subtype field + prometheus label.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I think we can probably give these all more specific names I think. Perhaps we just leverage the service name first?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I've plumbed through "service" as an additional label. The stringer has a mild caveat of including filepaths in the label value sometimes, though, so I'm tempted to replace .String() with config.FooServiceType constants? The cardinality isn't likely to be a huge issue and keeps individual outputs separate ... but it feels gross.

Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,45 @@ import (

"github.com/gravitational/trace"
"github.com/jonboulle/clockwork"
"github.com/prometheus/client_golang/prometheus"

"github.com/gravitational/teleport/api/utils/retryutils"
)

var (
loopIterationsCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "tbot_task_iterations_total",
Help: "Number of task iteration attempts, not counting retries",
}, []string{"service", "name"},
)
loopIterationsSuccessCounter = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "tbot_task_iterations_successful",
Help: "Histogram of task iterations that ultimately succeeded, bucketed by number of retries before success",
Buckets: []float64{0, 1, 2, 3, 4, 5},
}, []string{"service", "name"},
)
loopIterationsFailureCounter = prometheus.NewCounterVec(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to see this grouped with tbot_task_iterations using some kind of label to indicate status.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since we're tracking successful iterations via a histogram, the failed state - which would've otherwise been grouped into a labelled counter alongside successful - gets left out on its own. If you'd like, I could group them anyway and record tbot_task_iterations{status="successful"} as a duplicate of tbot_task_iterations_successful_count? (That one's recorded automatically as part of the histogram)

I've at least renamed tbot_task_iterations to tbot_task_iterations_total since I think that's a bit more in line with Prometheus conventions. That one definitely needs to remain separate, I think.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured that the histogram could also have a "state" label - but this is fine how it is if you'd rather proceed as is.

prometheus.CounterOpts{
Name: "tbot_task_iterations_failed",
Help: "Number of task iterations that ultimately failed, not counting retries",
}, []string{"service", "name"},
)
loopIterationTime = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "tbot_task_iteration_duration_seconds",
Help: "Time between beginning and ultimate end of one task iteration regardless of outcome, including all retries",
Buckets: prometheus.ExponentialBuckets(0.1, 1.75, 6),
}, []string{"service", "name"},
)
)

type runOnIntervalConfig struct {
name string
f func(ctx context.Context) error
clock clockwork.Clock
service string
name string
f func(ctx context.Context) error
clock clockwork.Clock
// reloadCh allows the task to be triggered immediately, ideal for handling
// CA rotations or a manual signal from a user.
// reloadCh can be nil, in which case, the task will only run on the
Expand All @@ -49,8 +80,6 @@ type runOnIntervalConfig struct {
// runOnInterval runs a function on a given interval, with retries and jitter.
//
// TODO(noah): Emit Prometheus metrics for:
// - Success/Failure of attempts
// - Time taken to execute attempt
// - Time of next attempt
func runOnInterval(ctx context.Context, cfg runOnIntervalConfig) error {
switch {
Expand Down Expand Up @@ -87,6 +116,9 @@ func runOnInterval(ctx context.Context, cfg runOnIntervalConfig) error {
}
firstRun = false

loopIterationsCounter.WithLabelValues(cfg.service, cfg.name).Inc()
startTime := time.Now()

var err error
for attempt := 1; attempt <= cfg.retryLimit; attempt++ {
log.InfoContext(
Expand All @@ -97,6 +129,7 @@ func runOnInterval(ctx context.Context, cfg runOnIntervalConfig) error {
)
err = cfg.f(ctx)
if err == nil {
loopIterationsSuccessCounter.WithLabelValues(cfg.service, cfg.name).Observe(float64(attempt - 1))
break
}

Expand All @@ -114,12 +147,20 @@ func runOnInterval(ctx context.Context, cfg runOnIntervalConfig) error {
)
select {
case <-ctx.Done():
// Note: will discard metric update for this loop. It
// probably won't be collected if we're shutting down,
// anyway.
return nil
case <-cfg.clock.After(backoffTime):
}
}
}

loopIterationTime.WithLabelValues(cfg.service, cfg.name).Observe(time.Since(startTime).Seconds())

if err != nil {
loopIterationsFailureCounter.WithLabelValues(cfg.service, cfg.name).Inc()

if cfg.exitOnRetryExhausted {
log.ErrorContext(
ctx,
Expand Down
1 change: 1 addition & 0 deletions lib/tbot/service_application_output.go
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ func (s *ApplicationOutputService) Run(ctx context.Context) error {
defer unsubscribe()

err := runOnInterval(ctx, runOnIntervalConfig{
service: s.String(),
name: "output-renewal",
f: s.generate,
interval: cmp.Or(s.cfg.CredentialLifetime, s.botCfg.CredentialLifetime).RenewalInterval,
Expand Down
3 changes: 2 additions & 1 deletion lib/tbot/service_bot_identity.go
Original file line number Diff line number Diff line change
Expand Up @@ -267,7 +267,8 @@ func (s *identityService) Run(ctx context.Context) error {
)

err := runOnInterval(ctx, runOnIntervalConfig{
name: "bot-identity-renewal",
service: s.String(),
name: "bot-identity-renewal",
f: func(ctx context.Context) error {
return s.renew(ctx, storageDestination)
},
Expand Down
1 change: 1 addition & 0 deletions lib/tbot/service_client_credential.go
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ func (s *ClientCredentialOutputService) Run(ctx context.Context) error {
defer unsubscribe()

err := runOnInterval(ctx, runOnIntervalConfig{
service: s.String(),
name: "output-renewal",
f: s.generate,
interval: s.botCfg.CredentialLifetime.RenewalInterval,
Expand Down
1 change: 1 addition & 0 deletions lib/tbot/service_database_output.go
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ func (s *DatabaseOutputService) Run(ctx context.Context) error {
defer unsubscribe()

err := runOnInterval(ctx, runOnIntervalConfig{
service: s.String(),
name: "output-renewal",
f: s.generate,
interval: cmp.Or(s.cfg.CredentialLifetime, s.botCfg.CredentialLifetime).RenewalInterval,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ func (s *heartbeatService) OneShot(ctx context.Context) error {
func (s *heartbeatService) Run(ctx context.Context) error {
isStartup := true
err := runOnInterval(ctx, runOnIntervalConfig{
service: s.String(),
name: "submit-heartbeat",
log: s.log,
interval: s.interval,
Expand Down
1 change: 1 addition & 0 deletions lib/tbot/service_identity_output.go
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ func (s *IdentityOutputService) Run(ctx context.Context) error {
defer unsubscribe()

err := runOnInterval(ctx, runOnIntervalConfig{
service: s.String(),
name: "output-renewal",
f: s.generate,
interval: cmp.Or(s.cfg.CredentialLifetime, s.botCfg.CredentialLifetime).RenewalInterval,
Expand Down
1 change: 1 addition & 0 deletions lib/tbot/service_kubernetes_output.go
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ func (s *KubernetesOutputService) Run(ctx context.Context) error {
defer unsubscribe()

err := runOnInterval(ctx, runOnIntervalConfig{
service: s.String(),
name: "output-renewal",
f: s.generate,
interval: cmp.Or(s.cfg.CredentialLifetime, s.botCfg.CredentialLifetime).RenewalInterval,
Expand Down
1 change: 1 addition & 0 deletions lib/tbot/service_kubernetes_v2_output.go
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ func (s *KubernetesV2OutputService) Run(ctx context.Context) error {
defer unsubscribe()

return trace.Wrap(runOnInterval(ctx, runOnIntervalConfig{
service: s.String(),
name: "output-renewal",
f: s.generate,
interval: cmp.Or(s.cfg.CredentialLifetime, s.botCfg.CredentialLifetime).RenewalInterval,
Expand Down
1 change: 1 addition & 0 deletions lib/tbot/service_ssh_host_output.go
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ func (s *SSHHostOutputService) Run(ctx context.Context) error {
defer unsubscribe()

err := runOnInterval(ctx, runOnIntervalConfig{
service: s.String(),
name: "output-renewal",
f: s.generate,
interval: cmp.Or(s.cfg.CredentialLifetime, s.botCfg.CredentialLifetime).RenewalInterval,
Expand Down
3 changes: 2 additions & 1 deletion lib/tbot/service_ssh_multiplexer.go
Original file line number Diff line number Diff line change
Expand Up @@ -391,7 +391,8 @@ func (s *SSHMultiplexerService) identityRenewalLoop(
reloadCh, unsubscribe := s.reloadBroadcaster.subscribe()
defer unsubscribe()
err := runOnInterval(ctx, runOnIntervalConfig{
name: "identity-renewal",
service: s.String(),
name: "identity-renewal",
f: func(ctx context.Context) error {
id, err := s.generateIdentity(ctx)
if err != nil {
Expand Down
9 changes: 8 additions & 1 deletion lib/tbot/tbot.go
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,14 @@ func (b *Bot) Run(ctx context.Context) (err error) {
defer func() { apitracing.EndSpan(span, err) }()
startedAt := time.Now()

if err := metrics.RegisterPrometheusCollectors(clientMetrics); err != nil {
if err := metrics.RegisterPrometheusCollectors(
metrics.BuildCollector(),
clientMetrics,
loopIterationsCounter,
loopIterationsSuccessCounter,
loopIterationsFailureCounter,
loopIterationTime,
); err != nil {
return trace.Wrap(err)
}

Expand Down