stats: add support for histograms in prometheus export by suhailpatel · Pull Request #5601 · envoyproxy/envoy

suhailpatel · 2019-01-15T00:00:46Z

Description: This PR adds support for native Prometheus histograms. I initially started exposing these as summaries but decided to do full histograms because it allows for far better aggregation server side especially when you have lots of Envoy deployments.

Risk Level: Medium
Testing: Tests have been added for the bucket calculation, mostly following the tests we have now. I do want to add more tests on the Prometheus export side but didn't want it to block the PR from being reviewed to ensure the approach is sane
Docs Changes: Update admin docs to denote the fact that we now support Prometheus histograms
Release Notes: Release notes have been updated to add a line noting that we now export Prometheus histograms

Fixes #1947

mattklein123 · 2019-01-15T00:11:35Z

Yay!!! Thank you! @ramaraochavali are you interested in doing a first pass review on this since you have worked with that code a lot?

ramaraochavali

Did a first pass, generally looks good, few comments

ramaraochavali · 2019-01-15T05:05:28Z

docs/root/operations/admin.rst

  Outputs /stats in `Prometheus <https://prometheus.io/docs/instrumenting/exposition_formats/>`_
-  v0.0.4 format. This can be used to integrate with a Prometheus server. Currently, only counters and
-  gauges are output. Histograms will be output in a future update.
+  v0.0.4 format. This can be used to integrate with a Prometheus server. Counters, gauges and


nit: I think you can delete the last sentence instead of specifically listing all of them.

ramaraochavali · 2019-01-15T05:09:08Z

source/common/stats/histogram_impl.cc

-HistogramStatisticsImpl::HistogramStatisticsImpl(const histogram_t* histogram_ptr)
-    : computed_quantiles_(supportedQuantiles().size(), 0.0) {
+HistogramStatisticsImpl::HistogramStatisticsImpl(const histogram_t* histogram_ptr) {
+  computed_quantiles_ = std::vector<double>(supportedQuantiles().size(), 0.0);


is there a reason why this has been moved out of initializer list?

ramaraochavali · 2019-01-15T05:16:18Z

source/common/stats/histogram_impl.cc

+  sample_sum_ = hist_approx_sum(histogram_ptr);
+
+  const std::vector<double>& supported_buckets_ref = supportedBuckets();
+  computed_buckets_ = std::vector<double>(supported_buckets_ref.size(), 0.0);


can we also move this to initializer?

ramaraochavali · 2019-01-15T05:17:41Z

source/common/stats/histogram_impl.cc

 }

+std::string HistogramStatisticsImpl::bucketSummary() const {
+  std::vector<std::string> bucketSummary;


nit: prefer variable name bucket_summary

ramaraochavali · 2019-01-15T05:18:46Z

source/common/stats/histogram_impl.cc

@@ -41,6 +66,15 @@ void HistogramStatisticsImpl::refresh(const histogram_t* new_histogram_ptr) {
  ASSERT(supportedQuantiles().size() == computed_quantiles_.size());


can we add similar ASSERT for buckets as well?

ramaraochavali · 2019-01-15T05:24:18Z

test/server/http/admin_test.cc


  Buffer::OwnedImpl response;
-  EXPECT_EQ(2UL, PrometheusStatsFormatter::statsAsPrometheus(counters_, gauges_, response));
+  EXPECT_EQ(2UL,


can you possibly add some histograms and verify here?

ramaraochavali · 2019-01-15T05:25:18Z

/wait

ramaraochavali · 2019-01-15T05:25:43Z

/retest

repokitteh-read-only · 2019-01-15T05:25:46Z

🔨 rebuilding ci/circleci: release (failed build)

🐱

Caused by: a #5601 (comment) was created by @ramaraochavali.

see: more, trace.

ramaraochavali · 2019-01-15T09:43:03Z

source/common/stats/histogram_impl.cc

+  sample_sum_ = hist_approx_sum(histogram_ptr);
+
+  const std::vector<double>& supported_buckets_ref = supportedBuckets();
+  for (size_t i = 0; i < supported_buckets_ref.size(); ++i) {


prefer range iteration here instead of this normal loop

ramaraochavali · 2019-01-15T09:44:07Z

source/common/stats/histogram_impl.cc


-std::string HistogramStatisticsImpl::summary() const {
+const std::vector<double>& HistogramStatisticsImpl::supportedBuckets() const {
+  static const std::vector<double> supported_buckets = {0.005, 0.01, 0.025, 0.05, 0.1, 0.25,


just curious, are these standard bucket sizes used/exported by prometheus generally?

I don't think this set of buckets is general enough. Consider a histogram for response size in bytes (or kilobytes), or for connection duration (in either seconds or milliseconds). Some not-so-uncommon request/response patterns will always land in the last bucket because they're way beyond the bounds.

In a subsequent PR I want to make this configurable via bootstrap config. I am open to suggestions if we want to expand this initial list though for the time being.

I wanted to capture a range of normal request timings, we could possibly add a few more buckets tending towards higher request timings (eg: 1 hour) to capture this data.

Prometheus default buckets (in seconds) are 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, +Inf

ramaraochavali · 2019-01-15T09:45:58Z

source/common/stats/histogram_impl.cc

+  std::fill(computed_buckets_.begin(), computed_buckets_.end(), 0.0);
+  ASSERT(supportedBuckets().size() == computed_buckets_.size());
+  const std::vector<double>& supported_buckets_ref = supportedBuckets();
+  for (size_t i = 0; i < supported_buckets_ref.size(); ++i) {


same here - prefer range based loop, else where also please change

ramaraochavali · 2019-01-15T12:40:19Z

/wait

ggreenway

This is great! Thanks for contributing this.

Please add a test case for the full text output of PrometheusStatsFormatter::statsAsPrometheus() for a sample histogram with some data.

ggreenway · 2019-01-15T21:36:16Z

include/envoy/stats/histogram.h

  virtual const std::vector<double>& computedQuantiles() const PURE;
+
+  /**
+   * Returns supported buckets.


Please add some documentation on what is in this vector. My assumption is that each value is the upper-bound of a bucket, with 0 as the implicit lower-bound of the first bucket.

ggreenway · 2019-01-15T21:37:16Z

include/envoy/stats/histogram.h

+  virtual const std::vector<double>& supportedBuckets() const PURE;
+
+  /**
+   * Returns computed bucket values during the period.


same as above: docs on exactly what this is. Also, I assume that this vector is guaranteed to be the same length as supportedBuckets(), please document this.

ggreenway · 2019-01-15T21:39:33Z

source/common/stats/histogram_impl.cc

                       computed_quantiles_.data());
+
+  sample_count_ = hist_sample_count(histogram_ptr);
+  sample_sum_ = hist_approx_sum(histogram_ptr);


What does the approximate sum mean here? If this is not exact, does it need to be documented in the interface?

This is an implementation detail of libcircllhist and the way it stores and sums counts of samples across bins. I'll pop a comment on the sampleCount function to make it a bit more clearer that it may not be an exact 100% accurate value

ggreenway · 2019-01-15T21:41:11Z

source/common/stats/histogram_impl.cc

+  sample_count_ = hist_sample_count(histogram_ptr);
+  sample_sum_ = hist_approx_sum(histogram_ptr);
+
+  const std::vector<double>& supported_buckets_ref = supportedBuckets();


Add: ASSERT(supported_buckets_ref.size() == computed_buckets_.size())

ggreenway · 2019-01-15T21:44:21Z

source/common/stats/histogram_impl.cc


-std::string HistogramStatisticsImpl::summary() const {
+const std::vector<double>& HistogramStatisticsImpl::supportedBuckets() const {
+  static const std::vector<double> supported_buckets = {0.005, 0.01, 0.025, 0.05, 0.1, 0.25,


I don't think this set of buckets is general enough. Consider a histogram for response size in bytes (or kilobytes), or for connection duration (in either seconds or milliseconds). Some not-so-uncommon request/response patterns will always land in the last bucket because they're way beyond the bounds.

ggreenway · 2019-01-15T21:46:33Z

source/common/stats/histogram_impl.cc

+  sample_count_ = hist_sample_count(histogram_ptr);
+  sample_sum_ = hist_approx_sum(histogram_ptr);
+
+  const std::vector<double>& supported_buckets_ref = supportedBuckets();


Here and throughout: no need to end variable names in _ref.

ggreenway · 2019-01-15T21:51:45Z

source/server/http/admin.cc

+    const std::vector<double>& supported_buckets_ref = stats.supportedBuckets();
+    for (size_t i = 0; i < supported_buckets_ref.size(); ++i) {
+      double bucket = supported_buckets_ref[i];
+      double value = stats.computedBuckets()[i];


Add a reference variable to computedBuckets() outside the loop for symmetry with supported_buckets

ggreenway · 2019-01-15T22:04:39Z

source/server/http/admin.cc

+    for (size_t i = 0; i < supported_buckets_ref.size(); ++i) {
+      double bucket = supported_buckets_ref[i];
+      double value = stats.computedBuckets()[i];
+      if (histogram->tags().size() > 0) {


I think instead of this if/else here (and below), it would be easier to read if you had a variable const std::string hist_tags = histogram->tags().empty() ? EMTPY_STRING : (tags + ","); Or use absl::StrJoin(). But then only have a single format string.

suhailpatel · 2019-01-15T23:16:02Z

Thanks very much for the comments, i'm going to work on this more tomorrow to get it polished up and address the feedback, also to add the Prometheus output test

ggreenway · 2019-01-15T23:46:45Z

/wait

ggreenway · 2019-01-16T17:50:09Z

source/common/stats/histogram_impl.cc

 const std::vector<double>& HistogramStatisticsImpl::supportedBuckets() const {
-  static const std::vector<double> supported_buckets = {0.005, 0.01, 0.025, 0.05, 0.1, 0.25,
-                                                        0.5,   1.0,  2.5,   5,    10};
+  static const std::vector<double> supported_buckets = {


I think this is a better default, but I'm still uncomfortable with hard coding it.

I think in the short term, we should allow configuring this via the bootstrap config. I'd prefer to have that in this PR. @htuch any opinion on if the bootstrap config is the correct place for something like this?

I think medium or long term, we should allow configuring the buckets either per-histogram, or for each type of histogram (time, bytes, etc).

I'm in agreement, after changing these values and experimenting with it in a set up with a few hundred clusters, the number of metrics was very large, especially when you don't need the same level of range. In other scenarios, you might want the range.

I'll make this part of the bootstrap config in this PR as soon as I get a test in for the output. Ultimately we will want different buckets based on type of histogram but not sure if we tag that as present.

Yeah, maybe in

envoy/api/envoy/config/metrics/v2/stats.proto

Line 42 in 906fd22

message StatsConfig {

?

One nice thing about Prometheus histograms is the buckets are cumulative. This allows you to use metric_relabel_configs on the server side to drop buckets you don't need. Of course there's still some resource use in generating and transferring the buckets over the wire. But the overhead is less than having to deal with storing and processing the extra buckets at query time.

ggreenway · 2019-01-16T18:03:26Z

/wait

suhailpatel · 2019-01-16T23:04:45Z

@ggreenway I added the Prometheus output test, I'll finish up the plumbing for specifying the buckets via config within the next 24 hours and then I believe it'll be ready for re-review. I'm perfectly happy if you wish to wait till after then before you re-review.

suhailpatel · 2019-01-16T23:07:00Z

/wait

…f the small nitpick comments Signed-off-by: Suhail Patel <me@suhailpatel.com>

Signed-off-by: Suhail Patel <me@suhailpatel.com>

operator Signed-off-by: Suhail Patel <me@suhailpatel.com>

suhailpatel · 2019-01-31T20:38:07Z

Apologies, i've been working on a few other things. I definitely want to move this forward. We're using it now actively and it's been working a treat for getting some great visibility into Envoy via Prometheus.

I've made a few more of the suggested changes. I'll open a separate PR on top of this branch for the usedonly changes (as it also warrants a line in the version history)

ggreenway

I'm ok merging this without the config in order to keep things moving. Someone can add that in easily in a future PR. Any @envoyproxy/maintainers disagree with that?

ggreenway · 2019-01-31T22:46:14Z

/retest

repokitteh-read-only · 2019-01-31T22:46:18Z

🔨 rebuilding ci/circleci: mac (failed build)

🐱

Caused by: a #5601 (comment) was created by @ggreenway.

see: more, trace.

ggreenway · 2019-01-31T22:57:38Z

/retest

repokitteh-read-only · 2019-01-31T22:57:41Z

🔨 rebuilding ci/circleci: mac (failed build)

🐱

Caused by: a #5601 (comment) was created by @ggreenway.

see: more, trace.

ggreenway · 2019-01-31T23:27:46Z

/retest

repokitteh-read-only · 2019-01-31T23:27:50Z

🔨 rebuilding ci/circleci: mac (failed build)

🐱

Caused by: a #5601 (comment) was created by @ggreenway.

see: more, trace.

mattklein123

Agreed, let's ship and iterate. Thank you so much this is an awesome addition that lots of people want.

youssefmamdouh · 2019-02-04T15:14:46Z

@suhailpatel hello, i am trying to use this feature now but i am confused, tried to understand from the code. lets say i query bucket 100 for rq_tim. x[100] gives me a number like 149, does that means the number of requests that took less than or equal of 100ms?

suhailpatel · 2019-02-04T16:43:43Z

@youssefmamdouh Yes that is correct, it gives you back the number of observations at or below that bucket value (so in your example there were 149 observations below 100).

The Prometheus documentation is an excellent resource on understanding the rationale of why this approach yields more accurate values, especially over multiple instances of Envoy: https://prometheus.io/docs/practices/histograms/

cube2222 · 2019-02-06T10:16:11Z

@suhailpatel Thank you for your work! (Just deployed for testing) Have you thought about labelling it with response codes?

Currently request counts are labelled with response codes, I think this would be a great enhancement for the histogram too (often, long tails are because of errors, not successful requests).

envoy_cluster_external_upstream_rq{envoy_response_code="200",envoy_cluster_name="local"} 547

Signed-off-by: Suhail Patel <me@suhailpatel.com> Signed-off-by: Fred Douglas <fredlas@google.com>

Since the recent upgrade to envoy 1.10 there is no need to use the statsd sink, as envoy exports those metrics by default to prometheus now. This was added on envoy PR: envoyproxy/envoy#5601 that is included in 1.10. This commits just removes the statsd references and updates the doc. With these changes, the statsd-enable contour flag can probably be removed on a future patch, will create an issue for that. Fixes: projectcontour#1035 Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>

suhailpatel mentioned this pull request Jan 15, 2019

stats: native prometheus export support #1947

Closed

2 tasks

lizan requested review from dio and ramaraochavali January 15, 2019 00:42

lizan assigned dio Jan 15, 2019

ramaraochavali suggested changes Jan 15, 2019

View reviewed changes

repokitteh-read-only bot added the waiting label Jan 15, 2019

repokitteh-read-only bot removed the waiting label Jan 15, 2019

ramaraochavali reviewed Jan 15, 2019

View reviewed changes

repokitteh-read-only bot added waiting and removed waiting labels Jan 15, 2019

ggreenway requested changes Jan 15, 2019

View reviewed changes

repokitteh-read-only bot added waiting and removed waiting labels Jan 15, 2019

ggreenway requested changes Jan 16, 2019

View reviewed changes

repokitteh-read-only bot added the waiting label Jan 16, 2019

ggreenway self-assigned this Jan 16, 2019

repokitteh-read-only bot removed the waiting label Jan 16, 2019

repokitteh-read-only bot added the waiting label Jan 16, 2019

suhailpatel added 6 commits January 31, 2019 20:34

Update buckets, add better function documentation and clean up some o…

bc52d50

…f the small nitpick comments Signed-off-by: Suhail Patel <me@suhailpatel.com>

Use range based loops

2618f71

Signed-off-by: Suhail Patel <me@suhailpatel.com>

Add Prometheus histogram end-to-end test

3585c2e

Signed-off-by: Suhail Patel <me@suhailpatel.com>

Use a uint64_t, ensure we suffix with _bucket

4bb2f02

Signed-off-by: Suhail Patel <me@suhailpatel.com>

Reorder version history so list of changes is alphabetical

85f9176

Signed-off-by: Suhail Patel <me@suhailpatel.com>

Increase our significant figures count to 32 and explain the 'g'

3015150

operator Signed-off-by: Suhail Patel <me@suhailpatel.com>

suhailpatel force-pushed the histogram-bucket-impl branch from a0d3df3 to 3015150 Compare January 31, 2019 20:37

ggreenway approved these changes Jan 31, 2019

View reviewed changes

mattklein123 approved these changes Jan 31, 2019

View reviewed changes

mattklein123 merged commit 6a58e5c into envoyproxy:master Jan 31, 2019

fredlas pushed a commit to fredlas/envoy that referenced this pull request Mar 5, 2019

stats: add support for histograms in prometheus export (envoyproxy#5601)

c6f7f99

Signed-off-by: Suhail Patel <me@suhailpatel.com> Signed-off-by: Fred Douglas <fredlas@google.com>

trevex mentioned this pull request Mar 25, 2019

Allow Envoy admin to bind to 0.0.0.0 so to scrape directly Envoy metrics / prometheus emissary-ingress/emissary#989

Closed

rata mentioned this pull request Apr 5, 2019

Add example deployments for prometheus / grafana projectcontour/contour#970

Merged

rata mentioned this pull request May 13, 2019

Update metrics examples for Envoy v1.10 projectcontour/contour#1085

Merged

hrobertson mentioned this pull request Jul 16, 2019

stats: Configurable Histogram Buckets #7599

Closed

ramaraochavali mentioned this pull request Oct 15, 2022

EnvoyMetricsService receiving empty histograms and summaries #21812

Closed

		@@ -41,6 +66,15 @@ void HistogramStatisticsImpl::refresh(const histogram_t* new_histogram_ptr) {
		ASSERT(supportedQuantiles().size() == computed_quantiles_.size());

Conversation

suhailpatel commented Jan 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattklein123 commented Jan 15, 2019

Uh oh!

ramaraochavali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ramaraochavali commented Jan 15, 2019

Uh oh!

ramaraochavali commented Jan 15, 2019

Uh oh!

repokitteh-read-only bot commented Jan 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

suhailpatel Jan 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ramaraochavali Jan 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ramaraochavali commented Jan 15, 2019

Uh oh!

ggreenway left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

suhailpatel commented Jan 15, 2019

Uh oh!

ggreenway commented Jan 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

suhailpatel Jan 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggreenway commented Jan 16, 2019

Uh oh!

suhailpatel commented Jan 15, 2019 •

edited

Loading

suhailpatel Jan 15, 2019 •

edited

Loading

ramaraochavali Jan 15, 2019 •

edited

Loading

suhailpatel Jan 16, 2019 •

edited

Loading

suhailpatel commented Jan 16, 2019 •

edited

Loading