Skip to content

RFD 197 - Prometheus metrics guidelines#51139

Open
hugoShaka wants to merge 7 commits intomasterfrom
rfd/0197-prometheus-metrics
Open

RFD 197 - Prometheus metrics guidelines#51139
hugoShaka wants to merge 7 commits intomasterfrom
rfd/0197-prometheus-metrics

Conversation

@hugoShaka
Copy link
Copy Markdown
Contributor

Following the addition of the internal metric registry in Teleport, here's a mini RFD about how to add metrics in teleport.

Rendered version.

@github-actions github-actions bot requested review from atburke and tigrato January 16, 2025 21:37
@github-actions github-actions bot added rfd Request for Discussion size/md labels Jan 16, 2025
Comment thread rfd/0197-prometheus-metrics.md Outdated
Copy link
Copy Markdown
Contributor

@rosstimothy rosstimothy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any gotchas in converting existing metrics from the global registry to a local registry?

Comment thread rfd/0197-prometheus-metrics.md Outdated
Comment thread rfd/0197-prometheus-metrics.md
Comment thread rfd/0197-prometheus-metrics.md
Comment thread rfd/0197-prometheus-metrics.md Outdated

#### Do

- Use `teleport.MetricsNamespace` as the namespace.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any strategies for migrating legacy metrics which do not have a namespace? Should we register the same metric with and without the namespace?

Copy link
Copy Markdown
Contributor Author

@hugoShaka hugoShaka Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently I did not suggest a migration strategy because this is a pretty disruptive change. I don't like double-registering metrics very much because it increases cardinality.

I guess we could pull a metric breaking change in a major version.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be terrible to double register the same metric in major version N and announce that the non-namespaced variants will be removed in version N+2? That would give people ~8 months of notice to adjust to use the correct metrics.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's acceptable, maybe we can gate the behaviour behind a TELEPORT_UNSTABLE_ env var or something, so if someone has an issue with this they can disable the duplication and use only the new or old metrics.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth checking which metrics would be affected and how many labels they have, it could be that the problem is small-ish.

Do we have metrics in the "wrong" namespaces that we would like to fix too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hugoShaka hugoShaka force-pushed the rfd/0197-prometheus-metrics branch from 52d7674 to 164e16e Compare January 16, 2025 22:34
Copy link
Copy Markdown
Contributor

@codingllama codingllama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Hugo!

LGTM.

Comment thread rfd/0197-prometheus-metrics.md Outdated
Comment thread rfd/0197-prometheus-metrics.md
Comment thread rfd/0197-prometheus-metrics.md
Comment thread rfd/0197-prometheus-metrics.md Outdated

#### Do

- Use `teleport.MetricsNamespace` as the namespace.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth checking which metrics would be affected and how many labels they have, it could be that the problem is small-ish.

Do we have metrics in the "wrong" namespaces that we would like to fix too?

Comment thread rfd/0197-prometheus-metrics.md Outdated
Comment thread rfd/0197-prometheus-metrics.md Outdated
Comment thread rfd/0197-prometheus-metrics.md
rosstimothy added a commit that referenced this pull request Jun 25, 2025
During the cache storage refactoring, all of the cache metrics
that were recorded via the backend.Reporter that wrapped the
cache backend were lost. This attempts to restore the metrics
as close to the original meaning as possible, though due to the
way the cache now works, some of the metrics don't make sense.
The biggest change is in the top requests metrics which has
historically been based on the backend.Key, but since that
is not present in the cache, the metric is labeled by the resource
kind. All existing dashboards and tctl top should still work
as expected.

The cache metrics were also refactored to honor best practices
laid out in [RFD 197](#51139).
@hugoShaka hugoShaka force-pushed the rfd/0197-prometheus-metrics branch from 92a08ed to 268976a Compare November 11, 2025 21:53
@hugoShaka hugoShaka added the no-changelog Indicates that a PR does not require a changelog entry label Nov 11, 2025
@hugoShaka
Copy link
Copy Markdown
Contributor Author

@codingllama is back (🎉) and I'm dealing with metrics problems now, so I resurrected the PR based on changes from #61239 and addressed your outstanding comments.

Copy link
Copy Markdown
Contributor

@codingllama codingllama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I'm less "in the know" this time around, but it seems reasonable and a straight up improvement to what we have.

Comment thread rfd/0197-prometheus-metrics.md Outdated
hosted plugins. We work around by creating a dedicared registry, and registering/unregistering it as a collector.
See https://github.com/prometheus/client_golang/pull/1766 for an example.

## Guidelines
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a guideline on what namespace / subsystem to use?

Should there always be a subsystem, or an empty one OK?

Co-authored-by: Alan Parra <12500300+codingllama@users.noreply.github.com>

- Take the local in-process registry as an argument in your service constructor/main routine, like you would receive a
logger, and register your metrics against it.
- Pass the registry as a `*metrics.Registry`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • What is the difference between a metrics.Registry and a prometheus.Registry or prometheus.Registerer?
  • What is the reasoning for preferring the metrics.Registry?
  • How do I get a handle to a metrics.Registry?
  • How do I write tests if I'm now dependent on a pointer to a concrete metrics.Registry object? Is there a testing variant that is a noop?

Copy link
Copy Markdown
Contributor

@codingllama codingllama Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disclaimer: I'm not Hugo.

Some of these you can glimpse from the code:

// Registry is a [prometheus.Registerer] for a Teleport process that
// allows propagating additional information such as:
// - the metric namespace (`teleport`, `teleport_bot`, `teleport_plugins`)
// - an optional subsystem
//
// This should be passed anywhere that needs to register a metric.
type Registry struct {
prometheus.Registerer
namespace string
subsystem string
}

  1. A metrics.Registry wraps a prometheus.Registerer. Registerer is the interface, Registry is the concrete type. We must use the latter because of item 4.
  2. metrics.Registry has some niceties over Registerer/prom.Registry
  3. metrics.NewRegistry() or metrics.NoopRegistry()
  4. Use metrics.NoopRegistry()

Now, if you want the design to answer the question, that's a perfectly valid request IMO. (Linking to metrics.Registry might be enough?)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linking to code would be helpful. I think the more context we can include here the easier it will be for people to follow the guidance properly.

}),
barCounter: // ...
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: flesh out the example a little bit more

Suggested change
if err := reg.Register(m.currentFoo); err != nil {
return metrics{}, trace.Wrap(err, "registering foo_current")
}
// register other metrics...


#### Do

- Honour the namespace and subsystem from the `metrics.Registry`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🇬🇧 Surprised the linter isn't complaining about honour vs honor. We must not have the localized spell checker on for RFDs.

Comment on lines +191 to +198
func newService(reg *metrics.Registry) {
go runComponentA(reg.Wrap("component_a"))
}

func newComponentA(reg *metrics.Registry) {
m := newMetrics(reg)
err := m.register(reg)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the point of this example but it's incomplete and leaves a lot to the reader. Would you mind fleshing this out a bit more?

#### Do

- Use `reg.Register()` to register the metric.
- Aggregate errors and fail early if you can't register metrics?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a question or guidance?

Suggested change
- Aggregate errors and fail early if you can't register metrics?
- Aggregate errors and fail early if you can't register metrics.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having conflicting opinions on this one. Do we want to:

  • fail fast in case of metric conflicts (potentially causing an outage)
  • continue in case of metric conflicts (losing visibility and potentially alerting)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-changelog Indicates that a PR does not require a changelog entry rfd Request for Discussion size/md

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants