-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics: Add update_aggregate_labels() #1966
base: master
Are you sure you want to change the base?
metrics: Add update_aggregate_labels() #1966
Conversation
Extends the metrics api to allow changing the aggregation labels of a metrics family. Otherwise one had to un-register every single metric instance in a metric family and then re-register with the changed aggregation labels. For metric families with thousands of instances (e.g.: histograms with lots of different labels) this is quite expensive. With this change we avoid the full reconstruction of the metrics family and all its metrics. Only the work associated with marking the metrics `dirty()` is needed then.
First some of the tests are failing, second, marking the metric with dirty will in fact recreate all the metadata, it's not saving anything, dirty happens on the next call to get the metrics values. The idea with dirty is that on startup we usually register lots of metrics, to avoid recreating the meta data object on each metric registration (a quadratic behavior) we delay it. If done properly, changing the aggregation label could avoid the dirty all together. What is the use case of changing the aggregation labels in the first place? |
One more thing, if there is a use case to change the aggregate labels (I would like to see), it would be best if we can add it the relabel_config |
The failing tests are #1913
Understand, maybe I am missing your point but we will have to recreate it once the aggregation labels are changed?
I don't immediately see how that would work? You mean iterating over the queue instead and updating in place or something? I guess that would be a slight perf improvement.
The tradeof with aggregation is that you lose information at the cost having less metrics series which reduces load/costs on your metrics infra. In redpanda we generally have two modes, one that adds lots of detailed labels and gives you the full info while the second aggregates lots of those labels away as metrics cardinality would be way too high otherwise which would make metrics infra costs sky rocket. The former is generally fine but during incidents having the full info is useful. Hence this change to cheaply update the aggregation labels.
I don't immediately see how this would be beneficial. The whole label relabing part is currently unrelated to the aggregation labels. They only come in at scrape time. Hence the API as added in this PR is a lot simpler. |
I'm saying that if you call dirty() the entire meta data will need to be re-created.
I'm saying that aggregation label are only applicable on reporting, so you don't need to rebuild the meta-data
How do you see that it's going to work? In practice, in run-time, how are you going to change that?
It might be simpler, but you just add an API to the metric layer that is disconnected from the API we already have and it's unclear to me how it's going to be used. |
The metadata is used when reporting though so if metadata is not rebuilt then it will use some outdated aggregation labels? This is also something I was running into initially when I was not setting the dirty flag.
So how we use this in redpanda is that we have registry for each metric which tracks the aggregation labels for either state of our "aggregate metrics" config flag being true or false. When the config flag is changed at runtime we call Previously the config flag was just checked on startup/metric creation time. You can see the specific code here: https://github.com/redpanda-data/redpanda/blob/dev/src/v/metrics/metrics_registry.cc#L55-L65
I really don't see how the relable_config API is related. While one could certainly shoehorn this in the relabel config API it would be inefficient at that. It has something like O(S*R) complexity (S == metric series, R == relable configs) with very high per OP cost because of the regex match. Further it would require a separate relabel config for each metric family. The new API is O(1) as it really just concerns itself with a single metric family. Conceptually I also think of relabel config happening on a layer above metric aggregation. |
All metrics with their data are stored in Changing the aggregation labels, does not require to change the structure of the Another alternative (which is probably even better) is to remove the duplication and have
How do you make sure that it runs over all shards? |
Looks reasonable to me, but will wait for @amnonh to approve. |
You can rebase to clear the CI failures. |
Ah I think I was misunderstanding you here. So what you are saying is that we shouldn't update the whole of This of course makes sense and I will have a look at some of your suggestions.
Each shard has to run it locally (similar to how each shard registers metrics separately).
Yes this is intentional though. Each metric family has its own aggregation labels so changing multiple at the same time doesn't really make sense. I can certainly see where something like the relabel_config approach is useful but the API as added really just serves as a counterpart to updating some of the fields that were previously specified when registering the metrics. |
Had a look at partially updating It's actually a bit more tricky than I thought as the whole thing is shared across shards via a shared_ptr so we can't easily change anything in there without extra protection. We could simplify Possibly we could do something a bit more tricky like doing a best effort update by checking |
This series adds metrics family config. Family config is a configuration that relates to all metrics with the same name but with different labels. set_metrics_family_configs allows changing that configuration during run time. Specifically, change the label aggregation based on a metric name. Using multiple independent labels on the same metric is a common practice. A known drawback is that this cartesian product makes it easy to discover in real-time that the number of metrics explodes. While metrics relabel config allows dropping some of those metrics, it could be useful to report an aggregated metric. The following is an example for setting the aggregate labels for the metric test_gauge_1 and all metrics matching the regex test_gauge1.* ``` std::vector<sm::metrics_family_config> fc(2); fc[0].name = "test_gauge_1"; fc[0].aggregate_labels = { "lb" }; fc[1].regex_name = "test_gauge1.*"; fc[1].aggregate_labels = { "ll", "aa" }; sm::set_metrics_family_configs(fc); ``` An alternative to #1966 Closes #2121 * github.com:scylladb/seastar: tests/unit/metrics_test.cc: Test metrics family config metrics: Adds metric_family_config support core/relabel_config.hh: support metric family config metrics_api.hh: Remove duplication of metric family metadata
This series adds metrics family config. Family config is a configuration that relates to all metrics with the same name but with different labels. set_metrics_family_configs allows changing that configuration during run time. Specifically, change the label aggregation based on a metric name. Using multiple independent labels on the same metric is a common practice. A known drawback is that this cartesian product makes it easy to discover in real-time that the number of metrics explodes. While metrics relabel config allows dropping some of those metrics, it could be useful to report an aggregated metric. The following is an example for setting the aggregate labels for the metric test_gauge_1 and all metrics matching the regex test_gauge1.* ``` std::vector<sm::metrics_family_config> fc(2); fc[0].name = "test_gauge_1"; fc[0].aggregate_labels = { "lb" }; fc[1].regex_name = "test_gauge1.*"; fc[1].aggregate_labels = { "ll", "aa" }; sm::set_metrics_family_configs(fc); ``` An alternative to scylladb#1966 Closes scylladb#2121 * github.com:scylladb/seastar: tests/unit/metrics_test.cc: Test metrics family config metrics: Adds metric_family_config support core/relabel_config.hh: support metric family config metrics_api.hh: Remove duplication of metric family metadata
Extends the metrics api to allow changing the aggregation labels of a
metrics family.
Otherwise one had to un-register every single metric instance in a
metric family and then re-register with the changed aggregation labels.
For metric families with thousands of instances (e.g.: histograms with
lots of different labels) this is quite expensive.
With this change we avoid the full reconstruction of the metrics family
and all its metrics. Only the work associated with marking the metrics
dirty()
is needed then.