Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
265 changes: 265 additions & 0 deletions rfd/0197-prometheus-metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,265 @@
---
authors: Hugo Hervieux (hugo.hervieux@goteleport.com)
state: draft
---

# RFD 197 - Prometheus metrics guidelines

## Required Approvers

* Engineering: @codingllama && @rosstimothy && @zmb3

## What

This RFD covers the recommended way of adding metrics to Teleport and other services such as
tbot, the Teleport Kube Agent Updater, the Teleport Kubernetes Operator, or plugins
(slack, teams, pagerduty, event-handler, ...).

## Why

Every component currently relies on the global metrics registry and package-scoped metrics.
This poses several challenges:
- We pick up and serve arbitrary metrics declared by any of our dependencies. This can cause registration conflicts
(errors, flakiness, panics), high cardinality, confusing metric names, and data loss.
- Several Teleport components cannot run in the same go process without causing metric conflicts. This is the case in
integration tests (running both tbot and Teleport), or even in user-facing binaries (tctl uses `embeddedtbot`).
- The same Teleport component cannot be started multiple times without causing metric conflicts, or inaccurate metrics.
For example, integration tests starting several Teleport instances will merge their metrics together.

## Conflicts

Metrics conflicts can happen in several cases:

1. The same metric is registered with different labels by several teleport processes. This is what happens with the
grpc metrics currently as teleport and tbot register global metrics with the same name but different labels. This can
be avoided in tests by using per-process registries.
2. The same metric is registered several times by different components. This can be avoided by namespacing and
adding metrics subsystem.
3. The same metric is registered both in the local and global registries. In this case, the process registry takes
precedence, the gathering succeeds but logs errors.
4. The same component is being started and stopped several times and re-register its metrics. This is the case for
hosted plugins. We work around by creating a dedicated registry, and registering/unregistering it as a collector.
See https://github.com/prometheus/client_golang/pull/1766 for an example.

## Guidelines
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a guideline on what namespace / subsystem to use?

Should there always be a subsystem, or an empty one OK?


### Metrics Registry

#### Do

- Take the local in-process registry as an argument in your service constructor/main routine, like you would receive a
logger, and register your metrics against it.
- Pass the registry as a `*metrics.Registry`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • What is the difference between a metrics.Registry and a prometheus.Registry or prometheus.Registerer?
  • What is the reasoning for preferring the metrics.Registry?
  • How do I get a handle to a metrics.Registry?
  • How do I write tests if I'm now dependent on a pointer to a concrete metrics.Registry object? Is there a testing variant that is a noop?

Copy link
Copy Markdown
Contributor

@codingllama codingllama Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disclaimer: I'm not Hugo.

Some of these you can glimpse from the code:

// Registry is a [prometheus.Registerer] for a Teleport process that
// allows propagating additional information such as:
// - the metric namespace (`teleport`, `teleport_bot`, `teleport_plugins`)
// - an optional subsystem
//
// This should be passed anywhere that needs to register a metric.
type Registry struct {
prometheus.Registerer
namespace string
subsystem string
}

  1. A metrics.Registry wraps a prometheus.Registerer. Registerer is the interface, Registry is the concrete type. We must use the latter because of item 4.
  2. metrics.Registry has some niceties over Registerer/prom.Registry
  3. metrics.NewRegistry() or metrics.NoopRegistry()
  4. Use metrics.NoopRegistry()

Now, if you want the design to answer the question, that's a perfectly valid request IMO. (Linking to metrics.Registry might be enough?)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linking to code would be helpful. I think the more context we can include here the easier it will be for people to follow the guidance properly.


```golang
func NewService(log *slog.Logger, reg *metrics.Registry) (Service, error) {
myMetric := prometheus.NewGauge(prometheus.GaugeOpts{
Namespace: reg.Namespace(),
Subsystem: reg.Subsystem(),
Name: "my_metric",
Help: "Measures the number of foos doing bars.",
}),
if err := reg.Register(myMetric); err != nil {
return nil, trace.Wrap(err, "registering metric")
}
// ...
}
```

#### Don't
Comment thread
hugoShaka marked this conversation as resolved.

- <details>
<summary>Register against the global prometheus registry</summary>

```golang
func NewService(log *slog.Logger) (Service, error) {
myMetric := prometheus.NewGauge(prometheus.GaugeOpts{
Namespace: teleport.MetricNamespace,
Subsystem: metricsSubsystem,
Name: "my_metric",
Help: "Measures the number of foos doing bars.",
}),
if err := prometheus.Register(myMetric); err != nil {
return nil, trace.Wrap(err, "registering metric")
}
// ...
}
```
</details>
- <details>
<summary>Pass the registry as a `*prometheus.Registry` or `prometheus.Registerer`</summary>

```golang
func NewService(log *slog.Logger, reg *prometheus.Registry) (Service, error) {
myMetric := prometheus.NewGauge(prometheus.GaugeOpts{
Namespace: teleport.MetricNamespace,
Subsystem: metricsSubsystem,
Name: "my_metric",
Help: "Measures the number of foos doing bars.",
}),
if err := reg.Register(myMetric); err != nil {
return nil, trace.Wrap(err, "registering metric")
}
// ...
}
```
</details>

### Storing metrics

#### Do

- Store metrics in a private struct in your package
Comment thread
hugoShaka marked this conversation as resolved.
Comment thread
hugoShaka marked this conversation as resolved.

```golang
type fooMetrics struct {
currentFoo prometheus.Gauge
barCounter *prometheus.CounterVec
}

type fooService struct {
metrics *fooMetrics
}

func newMetrics(reg *metrics.Registry) (metrics, error) {
m := metrics{
currentFoo: prometheus.NewGauge(prometheus.GaugeOpts{
Namespace: teleport.MetricNamespace,
Subsystem: metricsSubsystem,
Name: "foo_current",
Help: "Measures the number of foos.",
}),
barCounter: // ...
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: flesh out the example a little bit more

Suggested change
if err := reg.Register(m.currentFoo); err != nil {
return metrics{}, trace.Wrap(err, "registering foo_current")
}
// register other metrics...

return &m
}

```

#### Don't

- <details>
<summary>Store metrics in a package-scoped variable</summary>

```golang
var (
currentFoo = prometheus.NewGauge(prometheus.GaugeOpts{
Namespace: teleport.MetricNamespace,
Subsystem: metricsSubsystem,
Name: "foo_current",
Help: "Measures the number of foos.",
})
// ...
)
```
</details>

### Naming metrics
Comment thread
hugoShaka marked this conversation as resolved.

#### Do

- Honour the namespace and subsystem from the `metrics.Registry`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🇬🇧 Surprised the linter isn't complaining about honour vs honor. We must not have the localized spell checker on for RFDs.

- Wrap the `metrics.Registry` to add component-level information to the metrics subsystem.
- Follow [the prometheus metrics naming guidelines](https://prometheus.io/docs/practices/naming/),
especially always specify the unit and use a suffix to clarify the metric type (`_total`, `_info`).

```golang
type metrics struct {
currentFoo prometheus.Gauge
barCounter *prometheus.CounterVec
}

func newMetrics(reg *metrics.Registry) (*metrics, error) {
m := metrics{
currentFoo: prometheus.NewGauge(prometheus.GaugeOpts{
Namespace: teleport.MetricNamespace,
Subsystem: metricsSubsystem,
Name: "foo_timestamp_seconds",
Help: "Represents the foo time, in seconds.",
}),
barCounter: prometheus.NewCounter(prometheus.GaugeOpts{
Namespace: teleport.MetricNamespace,
Subsystem: metricsSubsystem,
Name: "bar_total",
Help: "Number of times bar happened.",
}),
}
// ...
}

func newService(reg *metrics.Registry) {
go runComponentA(reg.Wrap("component_a"))
}

func newComponentA(reg *metrics.Registry) {
m := newMetrics(reg)
err := m.register(reg)
}
Comment on lines +191 to +198
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the point of this example but it's incomplete and leaves a lot to the reader. Would you mind fleshing this out a bit more?

```

#### Don't

- Manually namespace metrics
<details>
<summary>Manually namespace metrics or create non-namespaced metrics</summary>

```golang
type metrics struct {
currentFoo prometheus.Gauge
barCounter *prometheus.CounterVec
}

func newMetrics(reg prometheus.Registerer) (*metrics, error) {
m := metrics{
currentFoo: prometheus.NewGauge(prometheus.GaugeOpts{
Namespace: "teleport"
Name: "foo_timestamp_seconds",
Help: "Represents the foo time, in seconds.",
}),
barCounter: prometheus.NewCounter(prometheus.GaugeOpts{
Name: "bar_total",
Help: "Number of times bar happened.",
}),
}
// ...
}
```
</details>

### Metric registration

#### Do

- Use `reg.Register()` to register the metric.
- Aggregate errors and fail early if you can't register metrics?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a question or guidance?

Suggested change
- Aggregate errors and fail early if you can't register metrics?
- Aggregate errors and fail early if you can't register metrics.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having conflicting opinions on this one. Do we want to:

  • fail fast in case of metric conflicts (potentially causing an outage)
  • continue in case of metric conflicts (losing visibility and potentially alerting)


#### Don't

- Use the package helpers `prometheus.Register()` or `prometheus.MustRegister`.
- Use `reg.MustRegister` as it panics in case of conflict and doesn't clearly indicate which metric descriptor conflicted.

### Labeling

#### Do

- Ensure no secret information is present in the labels.
- Avoid high cardinality labels

#### Don't

- Put user input directly in the labels
- Create large (1k+) metric combinations
- Create metrics with an ever growing number of labels

Comment thread
hugoShaka marked this conversation as resolved.
## Enforcing the guidelines

Some guidelines can be enforced by setting up linters:
- [promlinter](https://golangci-lint.run/usage/linters/#promlinter) to ensure that metric naming and labeling follows
the Prometheus guidelines. Note: we might not be able to use the strict mode as namespaces and subsystems are
passed from the caller.
- [forbidigo](https://golangci-lint.run/usage/linters/#forbidigo) to reject usages of `prometheus.DefaultRegisterer`,
`prometheus.(Must)Register`. `reg.MustRegister` conflicts with `backend.MustRegister`, we might not be able to detect
it (`forbidigo.analyze-types` might not be sufficient)

Existing non-compliant metrics and edge case usages will be allowed via `//nolint` comments.
4 changes: 4 additions & 0 deletions rfd/cspell.json
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,7 @@
"eksctl",
"Elastisearch",
"elif",
"embeddedtbot",
"endpointslices",
"enduml",
"Enes",
Expand Down Expand Up @@ -260,6 +261,7 @@
"Foos",
"footgun",
"foov",
"forbidigo",
"Fprintln",
"Frazar",
"fspmarshall",
Expand Down Expand Up @@ -490,6 +492,7 @@
"nklaassen",
"noahstride",
"nodename",
"nolint",
"nologin",
"NOPASSWD",
"noup",
Expand Down Expand Up @@ -596,6 +599,7 @@
"proce",
"producting",
"programmerq",
"promlinter",
"Proto",
"protobuf",
"protobufs",
Expand Down
Loading