High metric churn when using histogram #17

raags · 2020-09-05T08:53:24Z

We are using statsd_exporter, and I'm looking to plug VM histograms there. One issue I see is the high metric churn because each vmrange label is its own series. And we use a TTL, so there will be a large number of short duration metrics. I suppose it depends on what the cardinality of values is for the target applications. For a webapp ideally, they should be within 10s (which means 18 series). But as we start measuring parts of an application, for .e.g external dependencies like Redis, Mysql, etc this cardinality can explode.

So question is, in practice what is the cost of the VM histograms v/s the Prometheus histogram? Is it advisable to continue to use Prometheus style histograms, and use VM histograms for special cases?

Another approach I'm looking at to improve on Prometheus Histogram is to also publish summary metrics. i.e. every timer produces a histogram + summary. So if a histogram caps at 10s, the summary can provide quantiles at 1.0, 0.0, p95 etc; albeit not across series, it's still useful. This seems less "costly" for me compared to moving to VM histograms.

The text was updated successfully, but these errors were encountered:

valyala · 2020-09-07T21:45:07Z

And we use a TTL, so there will be a large number of short duration metrics

Could you provide more details on this? Ideally the same TTL should be applied to all the historgram buckets per each metric. This means that all the buckets for a metric should be removed after there were no updates for the metric during the configured TTL. It is incorrect to remove particular buckets for a single histogram due to TTL, while leaving other buckets for the same histogram.

So question is, in practice what is the cost of the VM histograms v/s the Prometheus histogram? Is it advisable to continue to use Prometheus style histograms, and use VM histograms for special cases?

The number of time series that are needed in practice for Prometheus-style (with le label) and VictoriaMetrics-style (with vmrange label) histograms should be comparable. VictoriaMetrics-style histograms has better compression rate, so they usually need less disk space comparing to Prometheus-style histograms.

raags · 2020-10-29T07:34:09Z

@valyala I'm trying to find a solution to reduce the cardinality explosion that can happen due to histograms. For e.g. our biggest metrics are all histograms; in the order of 80% of all metrics belong to these. I also noticed that we only need specific aggregates over specific labels, and rarely do ad-hoc querying.

One solution for e.g. is to write recording rules for the aggregates we need, like p100 (max), p99, p95, p50, p0(min) and average, and over fixed predefined labels, for e.g. this could be path or customer_name, etc. Now the graphs load much faster and reduced load on VM.

However, there are 2 issues

The original metrics still remain and occupy resources
Sometimes the cardinality is so high that even recording rules hit the -search.maxUniqueTimeseries limit
Recording rules have to be retroactively added, or dynamically generated based on some metadata

One solution I was thinking of is a lightweight histogram aggregator, that vmagent can divert all histograms to, which will calculate aggregates and write them back to VM. It can take a list of labels that it should aggregate across, bonus would be if this can be gleaned for the metric itself, maybe via a meta label. This is similar to datadog distribution type metric, which I suspect they do for the same reason.

valyala added the question Further information is requested label Sep 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High metric churn when using histogram #17

High metric churn when using histogram #17

raags commented Sep 5, 2020

valyala commented Sep 7, 2020

raags commented Oct 29, 2020

High metric churn when using histogram #17

High metric churn when using histogram #17

Comments

raags commented Sep 5, 2020

valyala commented Sep 7, 2020

raags commented Oct 29, 2020