-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High metric churn when using histogram #17
Comments
Could you provide more details on this? Ideally the same TTL should be applied to all the historgram buckets per each metric. This means that all the buckets for a metric should be removed after there were no updates for the metric during the configured TTL. It is incorrect to remove particular buckets for a single histogram due to TTL, while leaving other buckets for the same histogram.
The number of time series that are needed in practice for Prometheus-style (with |
@valyala I'm trying to find a solution to reduce the cardinality explosion that can happen due to histograms. For e.g. our biggest metrics are all histograms; in the order of 80% of all metrics belong to these. I also noticed that we only need specific aggregates over specific labels, and rarely do ad-hoc querying. One solution for e.g. is to write recording rules for the aggregates we need, like However, there are 2 issues
One solution I was thinking of is a lightweight histogram aggregator, that vmagent can divert all histograms to, which will calculate aggregates and write them back to VM. It can take a list of |
We are using statsd_exporter, and I'm looking to plug VM histograms there. One issue I see is the high metric churn because each
vmrange
label is its own series. And we use a TTL, so there will be a large number of short duration metrics. I suppose it depends on what the cardinality of values is for the target applications. For a webapp ideally, they should be within 10s (which means 18 series). But as we start measuring parts of an application, for .e.g external dependencies like Redis, Mysql, etc this cardinality can explode.So question is, in practice what is the cost of the VM histograms v/s the Prometheus histogram? Is it advisable to continue to use Prometheus style histograms, and use VM histograms for special cases?
Another approach I'm looking at to improve on Prometheus Histogram is to also publish summary metrics. i.e. every
timer
produces a histogram + summary. So if a histogram caps at 10s, the summary can provide quantiles at 1.0, 0.0, p95 etc; albeit not across series, it's still useful. This seems less "costly" for me compared to moving to VM histograms.The text was updated successfully, but these errors were encountered: