Skip to content

stats: add support for histograms in prometheus export#5601

Merged
mattklein123 merged 14 commits intoenvoyproxy:masterfrom
suhailpatel:histogram-bucket-impl
Jan 31, 2019
Merged

stats: add support for histograms in prometheus export#5601
mattklein123 merged 14 commits intoenvoyproxy:masterfrom
suhailpatel:histogram-bucket-impl

Conversation

@suhailpatel
Copy link
Contributor

@suhailpatel suhailpatel commented Jan 15, 2019

Description: This PR adds support for native Prometheus histograms. I initially started exposing these as summaries but decided to do full histograms because it allows for far better aggregation server side especially when you have lots of Envoy deployments.

Risk Level: Medium
Testing: Tests have been added for the bucket calculation, mostly following the tests we have now. I do want to add more tests on the Prometheus export side but didn't want it to block the PR from being reviewed to ensure the approach is sane
Docs Changes: Update admin docs to denote the fact that we now support Prometheus histograms
Release Notes: Release notes have been updated to add a line noting that we now export Prometheus histograms

Fixes #1947

@mattklein123
Copy link
Member

Yay!!! Thank you! @ramaraochavali are you interested in doing a first pass review on this since you have worked with that code a lot?

@lizan lizan requested review from dio and ramaraochavali January 15, 2019 00:42
Copy link
Contributor

@ramaraochavali ramaraochavali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass, generally looks good, few comments

Outputs /stats in `Prometheus <https://prometheus.io/docs/instrumenting/exposition_formats/>`_
v0.0.4 format. This can be used to integrate with a Prometheus server. Currently, only counters and
gauges are output. Histograms will be output in a future update.
v0.0.4 format. This can be used to integrate with a Prometheus server. Counters, gauges and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think you can delete the last sentence instead of specifically listing all of them.

HistogramStatisticsImpl::HistogramStatisticsImpl(const histogram_t* histogram_ptr)
: computed_quantiles_(supportedQuantiles().size(), 0.0) {
HistogramStatisticsImpl::HistogramStatisticsImpl(const histogram_t* histogram_ptr) {
computed_quantiles_ = std::vector<double>(supportedQuantiles().size(), 0.0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason why this has been moved out of initializer list?

sample_sum_ = hist_approx_sum(histogram_ptr);

const std::vector<double>& supported_buckets_ref = supportedBuckets();
computed_buckets_ = std::vector<double>(supported_buckets_ref.size(), 0.0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also move this to initializer?

}

std::string HistogramStatisticsImpl::bucketSummary() const {
std::vector<std::string> bucketSummary;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer variable name bucket_summary

@@ -41,6 +66,15 @@ void HistogramStatisticsImpl::refresh(const histogram_t* new_histogram_ptr) {
ASSERT(supportedQuantiles().size() == computed_quantiles_.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add similar ASSERT for buckets as well?


Buffer::OwnedImpl response;
EXPECT_EQ(2UL, PrometheusStatsFormatter::statsAsPrometheus(counters_, gauges_, response));
EXPECT_EQ(2UL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you possibly add some histograms and verify here?

@ramaraochavali
Copy link
Contributor

/wait

@ramaraochavali
Copy link
Contributor

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: release (failed build)

🐱

Caused by: a #5601 (comment) was created by @ramaraochavali.

see: more, trace.

sample_sum_ = hist_approx_sum(histogram_ptr);

const std::vector<double>& supported_buckets_ref = supportedBuckets();
for (size_t i = 0; i < supported_buckets_ref.size(); ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer range iteration here instead of this normal loop


std::string HistogramStatisticsImpl::summary() const {
const std::vector<double>& HistogramStatisticsImpl::supportedBuckets() const {
static const std::vector<double> supported_buckets = {0.005, 0.01, 0.025, 0.05, 0.1, 0.25,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, are these standard bucket sizes used/exported by prometheus generally?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this set of buckets is general enough. Consider a histogram for response size in bytes (or kilobytes), or for connection duration (in either seconds or milliseconds). Some not-so-uncommon request/response patterns will always land in the last bucket because they're way beyond the bounds.

Copy link
Contributor Author

@suhailpatel suhailpatel Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a subsequent PR I want to make this configurable via bootstrap config. I am open to suggestions if we want to expand this initial list though for the time being.

I wanted to capture a range of normal request timings, we could possibly add a few more buckets tending towards higher request timings (eg: 1 hour) to capture this data.

Prometheus default buckets (in seconds) are 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, +Inf

std::fill(computed_buckets_.begin(), computed_buckets_.end(), 0.0);
ASSERT(supportedBuckets().size() == computed_buckets_.size());
const std::vector<double>& supported_buckets_ref = supportedBuckets();
for (size_t i = 0; i < supported_buckets_ref.size(); ++i) {
Copy link
Contributor

@ramaraochavali ramaraochavali Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - prefer range based loop, else where also please change

@ramaraochavali
Copy link
Contributor

/wait

Copy link
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thanks for contributing this.

Please add a test case for the full text output of PrometheusStatsFormatter::statsAsPrometheus() for a sample histogram with some data.

virtual const std::vector<double>& computedQuantiles() const PURE;

/**
* Returns supported buckets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some documentation on what is in this vector. My assumption is that each value is the upper-bound of a bucket, with 0 as the implicit lower-bound of the first bucket.

virtual const std::vector<double>& supportedBuckets() const PURE;

/**
* Returns computed bucket values during the period.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above: docs on exactly what this is. Also, I assume that this vector is guaranteed to be the same length as supportedBuckets(), please document this.

computed_quantiles_.data());

sample_count_ = hist_sample_count(histogram_ptr);
sample_sum_ = hist_approx_sum(histogram_ptr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the approximate sum mean here? If this is not exact, does it need to be documented in the interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an implementation detail of libcircllhist and the way it stores and sums counts of samples across bins. I'll pop a comment on the sampleCount function to make it a bit more clearer that it may not be an exact 100% accurate value

sample_count_ = hist_sample_count(histogram_ptr);
sample_sum_ = hist_approx_sum(histogram_ptr);

const std::vector<double>& supported_buckets_ref = supportedBuckets();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add: ASSERT(supported_buckets_ref.size() == computed_buckets_.size())


std::string HistogramStatisticsImpl::summary() const {
const std::vector<double>& HistogramStatisticsImpl::supportedBuckets() const {
static const std::vector<double> supported_buckets = {0.005, 0.01, 0.025, 0.05, 0.1, 0.25,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this set of buckets is general enough. Consider a histogram for response size in bytes (or kilobytes), or for connection duration (in either seconds or milliseconds). Some not-so-uncommon request/response patterns will always land in the last bucket because they're way beyond the bounds.

sample_count_ = hist_sample_count(histogram_ptr);
sample_sum_ = hist_approx_sum(histogram_ptr);

const std::vector<double>& supported_buckets_ref = supportedBuckets();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and throughout: no need to end variable names in _ref.

const std::vector<double>& supported_buckets_ref = stats.supportedBuckets();
for (size_t i = 0; i < supported_buckets_ref.size(); ++i) {
double bucket = supported_buckets_ref[i];
double value = stats.computedBuckets()[i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a reference variable to computedBuckets() outside the loop for symmetry with supported_buckets

for (size_t i = 0; i < supported_buckets_ref.size(); ++i) {
double bucket = supported_buckets_ref[i];
double value = stats.computedBuckets()[i];
if (histogram->tags().size() > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of this if/else here (and below), it would be easier to read if you had a variable const std::string hist_tags = histogram->tags().empty() ? EMTPY_STRING : (tags + ","); Or use absl::StrJoin(). But then only have a single format string.

@suhailpatel
Copy link
Contributor Author

Thanks very much for the comments, i'm going to work on this more tomorrow to get it polished up and address the feedback, also to add the Prometheus output test

@ggreenway
Copy link
Member

/wait

const std::vector<double>& HistogramStatisticsImpl::supportedBuckets() const {
static const std::vector<double> supported_buckets = {0.005, 0.01, 0.025, 0.05, 0.1, 0.25,
0.5, 1.0, 2.5, 5, 10};
static const std::vector<double> supported_buckets = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a better default, but I'm still uncomfortable with hard coding it.

I think in the short term, we should allow configuring this via the bootstrap config. I'd prefer to have that in this PR. @htuch any opinion on if the bootstrap config is the correct place for something like this?

I think medium or long term, we should allow configuring the buckets either per-histogram, or for each type of histogram (time, bytes, etc).

Copy link
Contributor Author

@suhailpatel suhailpatel Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in agreement, after changing these values and experimenting with it in a set up with a few hundred clusters, the number of metrics was very large, especially when you don't need the same level of range. In other scenarios, you might want the range.

I'll make this part of the bootstrap config in this PR as soon as I get a test in for the output. Ultimately we will want different buckets based on type of histogram but not sure if we tag that as present.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe in

message StatsConfig {
?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nice thing about Prometheus histograms is the buckets are cumulative. This allows you to use metric_relabel_configs on the server side to drop buckets you don't need. Of course there's still some resource use in generating and transferring the buckets over the wire. But the overhead is less than having to deal with storing and processing the extra buckets at query time.

@ggreenway
Copy link
Member

/wait

@suhailpatel
Copy link
Contributor Author

suhailpatel commented Jan 16, 2019

@ggreenway I added the Prometheus output test, I'll finish up the plumbing for specifying the buckets via config within the next 24 hours and then I believe it'll be ready for re-review. I'm perfectly happy if you wish to wait till after then before you re-review.

@suhailpatel
Copy link
Contributor Author

/wait

…f the small nitpick comments

Signed-off-by: Suhail Patel <me@suhailpatel.com>
Signed-off-by: Suhail Patel <me@suhailpatel.com>
Signed-off-by: Suhail Patel <me@suhailpatel.com>
Signed-off-by: Suhail Patel <me@suhailpatel.com>
Signed-off-by: Suhail Patel <me@suhailpatel.com>
operator

Signed-off-by: Suhail Patel <me@suhailpatel.com>
@suhailpatel suhailpatel force-pushed the histogram-bucket-impl branch from a0d3df3 to 3015150 Compare January 31, 2019 20:37
@suhailpatel
Copy link
Contributor Author

Apologies, i've been working on a few other things. I definitely want to move this forward. We're using it now actively and it's been working a treat for getting some great visibility into Envoy via Prometheus.

I've made a few more of the suggested changes. I'll open a separate PR on top of this branch for the usedonly changes (as it also warrants a line in the version history)

Copy link
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok merging this without the config in order to keep things moving. Someone can add that in easily in a future PR. Any @envoyproxy/maintainers disagree with that?

@ggreenway
Copy link
Member

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: mac (failed build)

🐱

Caused by: a #5601 (comment) was created by @ggreenway.

see: more, trace.

@ggreenway
Copy link
Member

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: mac (failed build)

🐱

Caused by: a #5601 (comment) was created by @ggreenway.

see: more, trace.

@ggreenway
Copy link
Member

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: mac (failed build)

🐱

Caused by: a #5601 (comment) was created by @ggreenway.

see: more, trace.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, let's ship and iterate. Thank you so much this is an awesome addition that lots of people want.

@mattklein123 mattklein123 merged commit 6a58e5c into envoyproxy:master Jan 31, 2019
@youssefmamdouh
Copy link

@suhailpatel hello, i am trying to use this feature now but i am confused, tried to understand from the code. lets say i query bucket 100 for rq_tim. x[100] gives me a number like 149, does that means the number of requests that took less than or equal of 100ms?

@suhailpatel
Copy link
Contributor Author

@youssefmamdouh Yes that is correct, it gives you back the number of observations at or below that bucket value (so in your example there were 149 observations below 100).

The Prometheus documentation is an excellent resource on understanding the rationale of why this approach yields more accurate values, especially over multiple instances of Envoy: https://prometheus.io/docs/practices/histograms/

@cube2222
Copy link

cube2222 commented Feb 6, 2019

@suhailpatel Thank you for your work! (Just deployed for testing) Have you thought about labelling it with response codes?

Currently request counts are labelled with response codes, I think this would be a great enhancement for the histogram too (often, long tails are because of errors, not successful requests).

envoy_cluster_external_upstream_rq{envoy_response_code="200",envoy_cluster_name="local"} 547

fredlas pushed a commit to fredlas/envoy that referenced this pull request Mar 5, 2019
Signed-off-by: Suhail Patel <me@suhailpatel.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
rata added a commit to kinvolk-archives/contour that referenced this pull request May 13, 2019
Since the recent upgrade to envoy 1.10 there is no need to use the
statsd sink, as envoy exports those metrics by default to prometheus
now.

This was added on envoy PR:
envoyproxy/envoy#5601 that is included in 1.10.

This commits just removes the statsd references and updates the doc.

With these changes, the statsd-enable contour flag can probably be
removed on a future patch, will create an issue for that.

Fixes: projectcontour#1035
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
rata added a commit to kinvolk-archives/contour that referenced this pull request May 13, 2019
Since the recent upgrade to envoy 1.10 there is no need to use the
statsd sink, as envoy exports those metrics by default to prometheus
now.

This was added on envoy PR:
envoyproxy/envoy#5601 that is included in 1.10.

This commits just removes the statsd references and updates the doc.

With these changes, the statsd-enable contour flag can probably be
removed on a future patch, will create an issue for that.

Fixes: projectcontour#1035
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
rata added a commit to kinvolk-archives/contour that referenced this pull request May 15, 2019
Since the recent upgrade to envoy 1.10 there is no need to use the
statsd sink, as envoy exports those metrics by default to prometheus
now.

This was added on envoy PR:
envoyproxy/envoy#5601 that is included in 1.10.

This commits just removes the statsd references and updates the doc.

With these changes, the statsd-enable contour flag can probably be
removed on a future patch, will create an issue for that.

Fixes: projectcontour#1035
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants