Conversation
rosstimothy
left a comment
There was a problem hiding this comment.
Are there any gotchas in converting existing metrics from the global registry to a local registry?
|
|
||
| #### Do | ||
|
|
||
| - Use `teleport.MetricsNamespace` as the namespace. |
There was a problem hiding this comment.
Do we have any strategies for migrating legacy metrics which do not have a namespace? Should we register the same metric with and without the namespace?
There was a problem hiding this comment.
Currently I did not suggest a migration strategy because this is a pretty disruptive change. I don't like double-registering metrics very much because it increases cardinality.
I guess we could pull a metric breaking change in a major version.
There was a problem hiding this comment.
Would it be terrible to double register the same metric in major version N and announce that the non-namespaced variants will be removed in version N+2? That would give people ~8 months of notice to adjust to use the correct metrics.
There was a problem hiding this comment.
I think that's acceptable, maybe we can gate the behaviour behind a TELEPORT_UNSTABLE_ env var or something, so if someone has an issue with this they can disable the duplication and use only the new or old metrics.
There was a problem hiding this comment.
I think it's worth checking which metrics would be affected and how many labels they have, it could be that the problem is small-ish.
Do we have metrics in the "wrong" namespaces that we would like to fix too?
There was a problem hiding this comment.
52d7674 to
164e16e
Compare
|
|
||
| #### Do | ||
|
|
||
| - Use `teleport.MetricsNamespace` as the namespace. |
There was a problem hiding this comment.
I think it's worth checking which metrics would be affected and how many labels they have, it could be that the problem is small-ish.
Do we have metrics in the "wrong" namespaces that we would like to fix too?
During the cache storage refactoring, all of the cache metrics that were recorded via the backend.Reporter that wrapped the cache backend were lost. This attempts to restore the metrics as close to the original meaning as possible, though due to the way the cache now works, some of the metrics don't make sense. The biggest change is in the top requests metrics which has historically been based on the backend.Key, but since that is not present in the cache, the metric is labeled by the resource kind. All existing dashboards and tctl top should still work as expected. The cache metrics were also refactored to honor best practices laid out in [RFD 197](#51139).
92a08ed to
268976a
Compare
|
@codingllama is back (🎉) and I'm dealing with metrics problems now, so I resurrected the PR based on changes from #61239 and addressed your outstanding comments. |
codingllama
left a comment
There was a problem hiding this comment.
LGTM.
I'm less "in the know" this time around, but it seems reasonable and a straight up improvement to what we have.
| hosted plugins. We work around by creating a dedicared registry, and registering/unregistering it as a collector. | ||
| See https://github.com/prometheus/client_golang/pull/1766 for an example. | ||
|
|
||
| ## Guidelines |
There was a problem hiding this comment.
Should we add a guideline on what namespace / subsystem to use?
Should there always be a subsystem, or an empty one OK?
Co-authored-by: Alan Parra <12500300+codingllama@users.noreply.github.com>
|
|
||
| - Take the local in-process registry as an argument in your service constructor/main routine, like you would receive a | ||
| logger, and register your metrics against it. | ||
| - Pass the registry as a `*metrics.Registry` |
There was a problem hiding this comment.
- What is the difference between a metrics.Registry and a prometheus.Registry or prometheus.Registerer?
- What is the reasoning for preferring the metrics.Registry?
- How do I get a handle to a metrics.Registry?
- How do I write tests if I'm now dependent on a pointer to a concrete metrics.Registry object? Is there a testing variant that is a noop?
There was a problem hiding this comment.
Disclaimer: I'm not Hugo.
Some of these you can glimpse from the code:
teleport/lib/observability/metrics/registry.go
Lines 28 to 39 in e56a044
- A metrics.Registry wraps a prometheus.Registerer. Registerer is the interface, Registry is the concrete type. We must use the latter because of item 4.
- metrics.Registry has some niceties over Registerer/prom.Registry
- metrics.NewRegistry() or metrics.NoopRegistry()
- Use metrics.NoopRegistry()
Now, if you want the design to answer the question, that's a perfectly valid request IMO. (Linking to metrics.Registry might be enough?)
There was a problem hiding this comment.
Linking to code would be helpful. I think the more context we can include here the easier it will be for people to follow the guidance properly.
| }), | ||
| barCounter: // ... | ||
| } | ||
|
|
There was a problem hiding this comment.
Suggestion: flesh out the example a little bit more
| if err := reg.Register(m.currentFoo); err != nil { | |
| return metrics{}, trace.Wrap(err, "registering foo_current") | |
| } | |
| // register other metrics... |
|
|
||
| #### Do | ||
|
|
||
| - Honour the namespace and subsystem from the `metrics.Registry` |
There was a problem hiding this comment.
🇬🇧 Surprised the linter isn't complaining about honour vs honor. We must not have the localized spell checker on for RFDs.
| func newService(reg *metrics.Registry) { | ||
| go runComponentA(reg.Wrap("component_a")) | ||
| } | ||
|
|
||
| func newComponentA(reg *metrics.Registry) { | ||
| m := newMetrics(reg) | ||
| err := m.register(reg) | ||
| } |
There was a problem hiding this comment.
I understand the point of this example but it's incomplete and leaves a lot to the reader. Would you mind fleshing this out a bit more?
| #### Do | ||
|
|
||
| - Use `reg.Register()` to register the metric. | ||
| - Aggregate errors and fail early if you can't register metrics? |
There was a problem hiding this comment.
Is this a question or guidance?
| - Aggregate errors and fail early if you can't register metrics? | |
| - Aggregate errors and fail early if you can't register metrics. |
There was a problem hiding this comment.
I'm having conflicting opinions on this one. Do we want to:
- fail fast in case of metric conflicts (potentially causing an outage)
- continue in case of metric conflicts (losing visibility and potentially alerting)
Following the addition of the internal metric registry in Teleport, here's a mini RFD about how to add metrics in teleport.
Rendered version.