Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: Add update_aggregate_labels() #1966

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

StephanDollberg
Copy link
Contributor

Extends the metrics api to allow changing the aggregation labels of a
metrics family.

Otherwise one had to un-register every single metric instance in a
metric family and then re-register with the changed aggregation labels.

For metric families with thousands of instances (e.g.: histograms with
lots of different labels) this is quite expensive.

With this change we avoid the full reconstruction of the metrics family
and all its metrics. Only the work associated with marking the metrics
dirty() is needed then.

Extends the metrics api to allow changing the aggregation labels of a
metrics family.

Otherwise one had to un-register every single metric instance in a
metric family and then re-register with the changed aggregation labels.

For metric families with thousands of instances (e.g.: histograms with
lots of different labels) this is quite expensive.

With this change we avoid the full reconstruction of the metrics family
and all its metrics. Only the work associated with marking the metrics
`dirty()` is needed then.
@xemul xemul requested a review from amnonh November 28, 2023 07:37
@amnonh
Copy link
Contributor

amnonh commented Nov 28, 2023

First some of the tests are failing, second, marking the metric with dirty will in fact recreate all the metadata, it's not saving anything, dirty happens on the next call to get the metrics values.

The idea with dirty is that on startup we usually register lots of metrics, to avoid recreating the meta data object on each metric registration (a quadratic behavior) we delay it.

If done properly, changing the aggregation label could avoid the dirty all together.

What is the use case of changing the aggregation labels in the first place?

@amnonh
Copy link
Contributor

amnonh commented Nov 28, 2023

One more thing, if there is a use case to change the aggregate labels (I would like to see), it would be best if we can add it the relabel_config

@StephanDollberg
Copy link
Contributor Author

First some of the tests are failing, second, marking the metric with dirty will in fact recreate all the metadata, it's not saving anything, dirty happens on the next call to get the metrics values.

The failing tests are #1913

The idea with dirty is that on startup we usually register lots of metrics, to avoid recreating the meta data object on each metric registration (a quadratic behavior) we delay it.

Understand, maybe I am missing your point but we will have to recreate it once the aggregation labels are changed?

If done properly, changing the aggregation label could avoid the dirty all together.

I don't immediately see how that would work? You mean iterating over the queue instead and updating in place or something? I guess that would be a slight perf improvement.

What is the use case of changing the aggregation labels in the first place?

The tradeof with aggregation is that you lose information at the cost having less metrics series which reduces load/costs on your metrics infra.

In redpanda we generally have two modes, one that adds lots of detailed labels and gives you the full info while the second aggregates lots of those labels away as metrics cardinality would be way too high otherwise which would make metrics infra costs sky rocket.

The former is generally fine but during incidents having the full info is useful. Hence this change to cheaply update the aggregation labels.

One more thing, if there is a use case to change the aggregate labels (I would like to see), it would be best if we can add it the relabel_config

I don't immediately see how this would be beneficial. The whole label relabing part is currently unrelated to the aggregation labels. They only come in at scrape time. Hence the API as added in this PR is a lot simpler.

@amnonh
Copy link
Contributor

amnonh commented Nov 28, 2023

First some of the tests are failing, second, marking the metric with dirty will in fact recreate all the metadata, it's not saving anything, dirty happens on the next call to get the metrics values.

The failing tests are #1913

The idea with dirty is that on startup we usually register lots of metrics, to avoid recreating the meta data object on each metric registration (a quadratic behavior) we delay it.

Understand, maybe I am missing your point but we will have to recreate it once the aggregation labels are changed?

I'm saying that if you call dirty() the entire meta data will need to be re-created.

If done properly, changing the aggregation label could avoid the dirty all together.

I don't immediately see how that would work? You mean iterating over the queue instead and updating in place or something? I guess that would be a slight perf improvement.

I'm saying that aggregation label are only applicable on reporting, so you don't need to rebuild the meta-data

What is the use case of changing the aggregation labels in the first place?

The tradeof with aggregation is that you lose information at the cost having less metrics series which reduces load/costs on your metrics infra.

In redpanda we generally have two modes, one that adds lots of detailed labels and gives you the full info while the second aggregates lots of those labels away as metrics cardinality would be way too high otherwise which would make metrics infra costs sky rocket.

The former is generally fine but during incidents having the full info is useful. Hence this change to cheaply update the aggregation labels.

How do you see that it's going to work? In practice, in run-time, how are you going to change that?

One more thing, if there is a use case to change the aggregate labels (I would like to see), it would be best if we can add it the relabel_config

I don't immediately see how this would be beneficial. The whole label relabing part is currently unrelated to the aggregation labels. They only come in at scrape time. Hence the API as added in this PR is a lot simpler.

It might be simpler, but you just add an API to the metric layer that is disconnected from the API we already have and it's unclear to me how it's going to be used.

@StephanDollberg
Copy link
Contributor Author

StephanDollberg commented Nov 28, 2023

I'm saying that if you call dirty() the entire meta data will need to be re-created.

I'm saying that aggregation label are only applicable on reporting, so you don't need to rebuild the meta-data

The metadata is used when reporting though so if metadata is not rebuilt then it will use some outdated aggregation labels? This is also something I was running into initially when I was not setting the dirty flag.

How do you see that it's going to work? In practice, in run-time, how are you going to change that?

So how we use this in redpanda is that we have registry for each metric which tracks the aggregation labels for either state of our "aggregate metrics" config flag being true or false. When the config flag is changed at runtime we call update_aggregate_labels for each metric family with the aggregation labels that we then want to use.

Previously the config flag was just checked on startup/metric creation time.

You can see the specific code here: https://github.com/redpanda-data/redpanda/blob/dev/src/v/metrics/metrics_registry.cc#L55-L65

It might be simpler, but you just add an API to the metric layer that is disconnected from the API we already have and it's unclear to me how it's going to be used.

I really don't see how the relable_config API is related. set_relabel_configs applies the relabel config to all metric families and metric series. This can be useful if one wants to change a label on all series (for example shard) but not if wanting to change the aggregation label for one specific metric family.

While one could certainly shoehorn this in the relabel config API it would be inefficient at that. It has something like O(S*R) complexity (S == metric series, R == relable configs) with very high per OP cost because of the regex match. Further it would require a separate relabel config for each metric family. The new API is O(1) as it really just concerns itself with a single metric family.

Conceptually I also think of relabel config happening on a layer above metric aggregation.

@amnonh
Copy link
Contributor

amnonh commented Nov 29, 2023

The metadata is used when reporting though so if metadata is not rebuilt then it will use some outdated aggregation labels? This is also something I was running into initially when I was not setting the dirty flag.

All metrics with their data are stored in _value_map, a subset of the metrics (those that are enabled) are stored in _metadata which is updated when you set _dirty to true.

Changing the aggregation labels, does not require to change the structure of the _metadata you can update the labels directly and save the rebuild of the structure (which is expensive).

Another alternative (which is probably even better) is to remove the duplication and have _metadata hold a shard_ptr pointing to the _value_map, it would cost another redirection but it would save memory and time when creating the copy.

How do you see that it's going to work? In practice, in run-time, how are you going to change that?

So how we use this in redpanda is that we have registry for each metric which tracks the aggregation labels for either state of our "aggregate metrics" config flag being true or false. When the config flag is changed at runtime we call update_aggregate_labels for each metric family with the aggregation labels that we then want to use.

Previously the config flag was just checked on startup/metric creation time.

You can see the specific code here: https://github.com/redpanda-data/redpanda/blob/dev/src/v/metrics/metrics_registry.cc#L55-L65

How do you make sure that it runs over all shards?
Your implementation is very specific for a single metric, this functionality will be useful for multiple metrics, for example, you can use regular expression or prefix matching for the full metric name.

@avikivity
Copy link
Member

Looks reasonable to me, but will wait for @amnonh to approve.

@avikivity
Copy link
Member

You can rebase to clear the CI failures.

@StephanDollberg
Copy link
Contributor Author

Changing the aggregation labels, does not require to change the structure of the _metadata you can update the labels directly and save the rebuild of the structure (which is expensive).

Another alternative (which is probably even better) is to remove the duplication and have _metadata hold a shard_ptr pointing to the _value_map, it would cost another redirection but it would save memory and time when creating the copy.

Ah I think I was misunderstanding you here. So what you are saying is that we shouldn't update the whole of _metadata (as is done by marking dirty==true) but just do a targeted update of the aggregation labels?

This of course makes sense and I will have a look at some of your suggestions.

How do you make sure that it runs over all shards?

Each shard has to run it locally (similar to how each shard registers metrics separately).

Your implementation is very specific for a single metric, this functionality will be useful for multiple metrics, for example, you can use regular expression or prefix matching for the full metric name.

Yes this is intentional though. Each metric family has its own aggregation labels so changing multiple at the same time doesn't really make sense.

I can certainly see where something like the relabel_config approach is useful but the API as added really just serves as a counterpart to updating some of the fields that were previously specified when registering the metrics.

@StephanDollberg
Copy link
Contributor Author

Had a look at partially updating _metadata.

It's actually a bit more tricky than I thought as the whole thing is shared across shards via a shared_ptr so we can't easily change anything in there without extra protection.

We could simplify update_metrics_if_needed to only do "half" of the update (i.e.: only update _metadata and not _current_metrics) but I don't actually think this will save much as we would still need to rebuild lots of the structures. So probably not worth the effort.

Possibly we could do something a bit more tricky like doing a best effort update by checking _metadata.use_count() in update_metrics_if_needed and then update the individual members in _metadata directly if needed. I guess in practice this should work out fine given how the call chains work and users of get_values will like have dropped their shared_ptr before calling get_values the next time (which will then trigger the update). I can get up a patch like that if you think it's worth persuing.

nyh added a commit that referenced this pull request Mar 20, 2024
This series adds metrics family config.

Family config is a configuration that relates to all metrics with the same name but with different labels.

set_metrics_family_configs allows changing that configuration during run time. Specifically, change the label aggregation based on a metric name.

Using multiple independent labels on the same metric is a common practice. A known drawback is that this cartesian product makes it easy to discover in real-time that the number of metrics explodes.
While metrics relabel config allows dropping some of those metrics, it could be useful to report an aggregated metric.

The following is an example for setting the aggregate labels for the metric test_gauge_1 and all metrics matching the regex test_gauge1.*

```
std::vector<sm::metrics_family_config> fc(2);
fc[0].name = "test_gauge_1";
fc[0].aggregate_labels = { "lb" };
fc[1].regex_name = "test_gauge1.*";
fc[1].aggregate_labels = { "ll", "aa" };

sm::set_metrics_family_configs(fc);
```

An alternative to #1966

Closes #2121

* github.com:scylladb/seastar:
  tests/unit/metrics_test.cc: Test metrics family config
  metrics: Adds metric_family_config support
  core/relabel_config.hh: support metric family config
  metrics_api.hh: Remove duplication of metric family metadata
graphcareful pushed a commit to graphcareful/seastar that referenced this pull request Mar 20, 2024
This series adds metrics family config.

Family config is a configuration that relates to all metrics with the same name but with different labels.

set_metrics_family_configs allows changing that configuration during run time. Specifically, change the label aggregation based on a metric name.

Using multiple independent labels on the same metric is a common practice. A known drawback is that this cartesian product makes it easy to discover in real-time that the number of metrics explodes.
While metrics relabel config allows dropping some of those metrics, it could be useful to report an aggregated metric.

The following is an example for setting the aggregate labels for the metric test_gauge_1 and all metrics matching the regex test_gauge1.*

```
std::vector<sm::metrics_family_config> fc(2);
fc[0].name = "test_gauge_1";
fc[0].aggregate_labels = { "lb" };
fc[1].regex_name = "test_gauge1.*";
fc[1].aggregate_labels = { "ll", "aa" };

sm::set_metrics_family_configs(fc);
```

An alternative to scylladb#1966

Closes scylladb#2121

* github.com:scylladb/seastar:
  tests/unit/metrics_test.cc: Test metrics family config
  metrics: Adds metric_family_config support
  core/relabel_config.hh: support metric family config
  metrics_api.hh: Remove duplication of metric family metadata
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants