RFD 197 - Prometheus metrics guidelines by hugoShaka · Pull Request #51139 · gravitational/teleport

hugoShaka · 2025-01-16T21:37:13Z

Following the addition of the internal metric registry in Teleport, here's a mini RFD about how to add metrics in teleport.

Rendered version.

rosstimothy

Are there any gotchas in converting existing metrics from the global registry to a local registry?

rosstimothy · 2025-01-16T22:16:04Z

+
+#### Do
+
+- Use `teleport.MetricsNamespace` as the namespace.


Do we have any strategies for migrating legacy metrics which do not have a namespace? Should we register the same metric with and without the namespace?

Currently I did not suggest a migration strategy because this is a pretty disruptive change. I don't like double-registering metrics very much because it increases cardinality.

I guess we could pull a metric breaking change in a major version.

Would it be terrible to double register the same metric in major version N and announce that the non-namespaced variants will be removed in version N+2? That would give people ~8 months of notice to adjust to use the correct metrics.

I think that's acceptable, maybe we can gate the behaviour behind a TELEPORT_UNSTABLE_ env var or something, so if someone has an issue with this they can disable the duplication and use only the new or old metrics.

I think it's worth checking which metrics would be affected and how many labels they have, it could be that the problem is small-ish.

Do we have metrics in the "wrong" namespaces that we would like to fix too?

We have quite a few: https://goteleport.com/docs/reference/deployment/monitoring/metrics/#auth-service-and-backends

codingllama

Thanks, Hugo!

LGTM.

codingllama · 2025-01-20T20:01:59Z

+
+#### Do
+
+- Use `teleport.MetricsNamespace` as the namespace.


I think it's worth checking which metrics would be affected and how many labels they have, it could be that the problem is small-ish.

Do we have metrics in the "wrong" namespaces that we would like to fix too?

During the cache storage refactoring, all of the cache metrics that were recorded via the backend.Reporter that wrapped the cache backend were lost. This attempts to restore the metrics as close to the original meaning as possible, though due to the way the cache now works, some of the metrics don't make sense. The biggest change is in the top requests metrics which has historically been based on the backend.Key, but since that is not present in the cache, the metric is labeled by the resource kind. All existing dashboards and tctl top should still work as expected. The cache metrics were also refactored to honor best practices laid out in [RFD 197](#51139).

hugoShaka · 2025-11-11T22:37:15Z

@codingllama is back (🎉) and I'm dealing with metrics problems now, so I resurrected the PR based on changes from #61239 and addressed your outstanding comments.

codingllama

LGTM.

I'm less "in the know" this time around, but it seems reasonable and a straight up improvement to what we have.

codingllama · 2025-11-14T20:05:02Z

+   hosted plugins. We work around by creating a dedicared registry, and registering/unregistering it as a collector. 
+   See https://github.com/prometheus/client_golang/pull/1766 for an example.
+
+## Guidelines


Should we add a guideline on what namespace / subsystem to use?

Should there always be a subsystem, or an empty one OK?

Co-authored-by: Alan Parra <12500300+codingllama@users.noreply.github.com>

rosstimothy · 2025-11-18T13:34:16Z

+
+- Take the local in-process registry as an argument in your service constructor/main routine, like you would receive a
+  logger, and register your metrics against it.
+- Pass the registry as a `*metrics.Registry`


What is the difference between a metrics.Registry and a prometheus.Registry or prometheus.Registerer?

What is the reasoning for preferring the metrics.Registry?

How do I get a handle to a metrics.Registry?

How do I write tests if I'm now dependent on a pointer to a concrete metrics.Registry object? Is there a testing variant that is a noop?

Disclaimer: I'm not Hugo.

Some of these you can glimpse from the code:

teleport/lib/observability/metrics/registry.go

Lines 28 to 39 in e56a044

// Registry is a [prometheus.Registerer] for a Teleport process that

// allows propagating additional information such as:

// - the metric namespace (`teleport`, `teleport_bot`, `teleport_plugins`)

// - an optional subsystem

//

// This should be passed anywhere that needs to register a metric.

type Registry struct {

prometheus.Registerer

namespace string

subsystem string

}

A metrics.Registry wraps a prometheus.Registerer. Registerer is the interface, Registry is the concrete type. We must use the latter because of item 4.

metrics.Registry has some niceties over Registerer/prom.Registry

metrics.NewRegistry() or metrics.NoopRegistry()

Use metrics.NoopRegistry()

Now, if you want the design to answer the question, that's a perfectly valid request IMO. (Linking to metrics.Registry might be enough?)

Linking to code would be helpful. I think the more context we can include here the easier it will be for people to follow the guidance properly.

rosstimothy · 2025-11-18T13:35:55Z

+        }),
+        barCounter: // ...
+    }
+


Suggestion: flesh out the example a little bit more

Suggested change

if err := reg.Register(m.currentFoo); err != nil {

return metrics{}, trace.Wrap(err, "registering foo_current")

}

// register other metrics...

rosstimothy · 2025-11-18T13:38:14Z

+
+#### Do
+
+- Honour the namespace and subsystem from the `metrics.Registry`


🇬🇧 Surprised the linter isn't complaining about honour vs honor. We must not have the localized spell checker on for RFDs.

rosstimothy · 2025-11-18T13:39:37Z

+func newService(reg *metrics.Registry) {
+	go runComponentA(reg.Wrap("component_a"))
+}
+
+func newComponentA(reg *metrics.Registry) {
+    m := newMetrics(reg)
+	err := m.register(reg)
+}


I understand the point of this example but it's incomplete and leaves a lot to the reader. Would you mind fleshing this out a bit more?

rosstimothy · 2025-11-18T13:41:13Z

+#### Do
+
+- Use `reg.Register()` to register the metric.
+- Aggregate errors and fail early if you can't register metrics?


Is this a question or guidance?

Suggested change

- Aggregate errors and fail early if you can't register metrics?

- Aggregate errors and fail early if you can't register metrics.

I'm having conflicting opinions on this one. Do we want to:

fail fast in case of metric conflicts (potentially causing an outage)

continue in case of metric conflicts (losing visibility and potentially alerting)

hugoShaka requested review from codingllama, rosstimothy and zmb3 January 16, 2025 21:37

github-actions bot requested review from atburke and tigrato January 16, 2025 21:37

github-actions bot added rfd Request for Discussion size/md labels Jan 16, 2025

hugoShaka commented Jan 16, 2025

View reviewed changes

Comment thread rfd/0197-prometheus-metrics.md Outdated

rosstimothy reviewed Jan 16, 2025

View reviewed changes

hugoShaka force-pushed the rfd/0197-prometheus-metrics branch from 52d7674 to 164e16e Compare January 16, 2025 22:34

codingllama reviewed Jan 20, 2025

View reviewed changes

rosstimothy mentioned this pull request Jun 25, 2025

Restore cache metrics #56096

Merged

hugoShaka added 4 commits November 11, 2025 16:20

RFD 197 - Prometheus metrics guidelines

1e08915

Fix indent and trailing spaces

b9669af

Add section about lint

09447cf

refresh to use metrics.Registry

268976a

hugoShaka force-pushed the rfd/0197-prometheus-metrics branch from 92a08ed to 268976a Compare November 11, 2025 21:53

hugoShaka added 2 commits November 11, 2025 17:25

lint

f8c758a

add conflicts sections

7fff972

hugoShaka added the no-changelog Indicates that a PR does not require a changelog entry label Nov 11, 2025

hugoShaka requested review from codingllama and rosstimothy November 11, 2025 22:37

strideynet self-requested a review November 12, 2025 10:53

codingllama approved these changes Nov 14, 2025

View reviewed changes

Update rfd/0197-prometheus-metrics.md

f614d34

Co-authored-by: Alan Parra <12500300+codingllama@users.noreply.github.com>

rosstimothy reviewed Nov 18, 2025

View reviewed changes

	// Registry is a [prometheus.Registerer] for a Teleport process that
	// allows propagating additional information such as:
	// - the metric namespace (`teleport`, `teleport_bot`, `teleport_plugins`)
	// - an optional subsystem
	//
	// This should be passed anywhere that needs to register a metric.
	type Registry struct {
	prometheus.Registerer

	namespace string
	subsystem string
	}

+    if err := reg.Register(m.currentFoo); err != nil {
+       return metrics{}, trace.Wrap(err, "registering foo_current")
+    }
+    // register other metrics...


		#### Do

		- Honour the namespace and subsystem from the `metrics.Registry`

	- Aggregate errors and fail early if you can't register metrics?
	- Aggregate errors and fail early if you can't register metrics.

Conversation

hugoShaka commented Jan 16, 2025

Uh oh!

Uh oh!

rosstimothy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hugoShaka Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codingllama left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hugoShaka commented Nov 11, 2025

Uh oh!

codingllama left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codingllama Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hugoShaka Jan 16, 2025 •

edited

Loading

codingllama Nov 18, 2025 •

edited

Loading