Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High metric churn when using histogram #17

Open
raags opened this issue Sep 5, 2020 · 2 comments
Open

High metric churn when using histogram #17

raags opened this issue Sep 5, 2020 · 2 comments
Labels
question Further information is requested

Comments

@raags
Copy link

raags commented Sep 5, 2020

We are using statsd_exporter, and I'm looking to plug VM histograms there. One issue I see is the high metric churn because each vmrange label is its own series. And we use a TTL, so there will be a large number of short duration metrics. I suppose it depends on what the cardinality of values is for the target applications. For a webapp ideally, they should be within 10s (which means 18 series). But as we start measuring parts of an application, for .e.g external dependencies like Redis, Mysql, etc this cardinality can explode.

So question is, in practice what is the cost of the VM histograms v/s the Prometheus histogram? Is it advisable to continue to use Prometheus style histograms, and use VM histograms for special cases?

Another approach I'm looking at to improve on Prometheus Histogram is to also publish summary metrics. i.e. every timer produces a histogram + summary. So if a histogram caps at 10s, the summary can provide quantiles at 1.0, 0.0, p95 etc; albeit not across series, it's still useful. This seems less "costly" for me compared to moving to VM histograms.

@valyala valyala added the question Further information is requested label Sep 7, 2020
@valyala
Copy link
Contributor

valyala commented Sep 7, 2020

And we use a TTL, so there will be a large number of short duration metrics

Could you provide more details on this? Ideally the same TTL should be applied to all the historgram buckets per each metric. This means that all the buckets for a metric should be removed after there were no updates for the metric during the configured TTL. It is incorrect to remove particular buckets for a single histogram due to TTL, while leaving other buckets for the same histogram.

So question is, in practice what is the cost of the VM histograms v/s the Prometheus histogram? Is it advisable to continue to use Prometheus style histograms, and use VM histograms for special cases?

The number of time series that are needed in practice for Prometheus-style (with le label) and VictoriaMetrics-style (with vmrange label) histograms should be comparable. VictoriaMetrics-style histograms has better compression rate, so they usually need less disk space comparing to Prometheus-style histograms.

@raags
Copy link
Author

raags commented Oct 29, 2020

@valyala I'm trying to find a solution to reduce the cardinality explosion that can happen due to histograms. For e.g. our biggest metrics are all histograms; in the order of 80% of all metrics belong to these. I also noticed that we only need specific aggregates over specific labels, and rarely do ad-hoc querying.

One solution for e.g. is to write recording rules for the aggregates we need, like p100 (max), p99, p95, p50, p0(min) and average, and over fixed predefined labels, for e.g. this could be path or customer_name, etc. Now the graphs load much faster and reduced load on VM.

However, there are 2 issues

  1. The original metrics still remain and occupy resources
  2. Sometimes the cardinality is so high that even recording rules hit the -search.maxUniqueTimeseries limit
  3. Recording rules have to be retroactively added, or dynamically generated based on some metadata

One solution I was thinking of is a lightweight histogram aggregator, that vmagent can divert all histograms to, which will calculate aggregates and write them back to VM. It can take a list of labels that it should aggregate across, bonus would be if this can be gleaned for the metric itself, maybe via a meta label. This is similar to datadog distribution type metric, which I suspect they do for the same reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants