Skip to content

Add stats to record latency metrics for each response code in NH#381

Merged
mum4k merged 49 commits intoenvoyproxy:masterfrom
qqustc:pr3
Jul 23, 2020
Merged

Add stats to record latency metrics for each response code in NH#381
mum4k merged 49 commits intoenvoyproxy:masterfrom
qqustc:pr3

Conversation

@qqustc
Copy link
Copy Markdown
Contributor

@qqustc qqustc commented Jun 22, 2020

Add the following latency histogram in benchmark_client_impl :

  • benchmark_http_client.latency_1xx | SinkableHdrStatistic | Latency (in Nanosecond) histogram of request with code 1xx
  • benchmark_http_client.latency_2xx | SinkableHdrStatistic | Latency (in Nanosecond) histogram of request with code 2xx
  • benchmark_http_client.latency_3xx | SinkableHdrStatistic | Latency (in Nanosecond) histogram of request with code 3xx
  • benchmark_http_client.latency_4xx | SinkableHdrStatistic | Latency (in Nanosecond) histogram of request with code 4xx
  • benchmark_http_client.latency_5xx | SinkableHdrStatistic | Latency (in Nanosecond) histogram of request with code 5xx
  • benchmark_http_client.latency_xxx | SinkableHdrStatistic | Latency (in Nanosecond) histogram of request with code <100 or >=600

Linked Issues: #344

Testing: unit tests
Docs Changes: Add docs/root/statistics.md

Signed-off-by: Qin Qin qqin@google.com

…metrics in NH (#1)

* Create statistics.md

* Update benchmark_client_impl.cc

* Update benchmark_client_impl.h

* Update stream_decoder.cc

* Update stream_decoder.h

* Update BUILD

* Update benchmark_http_client_test.cc

* Update test_integration_basics.py

* Update stream_decoder_test.cc

* local commit 1
Signed-off-by: qqustc <qqin@google.com>

Commit from master
Signed-off-by: qqustc <qqustc@gmail.com>
@qqustc qqustc marked this pull request as ready for review June 22, 2020 23:14
@qqustc qqustc closed this Jun 23, 2020
@qqustc qqustc reopened this Jun 23, 2020
@qqustc qqustc changed the title Add counter/histogram to record total number of requests and latency metrics in NH [Draft]Add counter/histogram to record total number of requests and latency metrics in NH Jun 23, 2020
Signed-off-by: qqustc <qqin@google.com>

Signed-off-by: qqustc <qqustc@gmail.com>
Signed-off-by: qqustc <qqin@google.com>
@oschaaf
Copy link
Copy Markdown
Member

oschaaf commented Jun 24, 2020

Could the abrupt socket termination complaints in the longer running ci tasks be associated to a subprocess being killed because of the closing/re-opening of the PR here? /retest

@qqustc
Copy link
Copy Markdown
Contributor Author

qqustc commented Jun 24, 2020

ociated to a subprocess being killed becau

I don't think so. When I closed and reopened the PR, there was no CI tasks running. The CI tasks only started to run after I made the last commit.

@oschaaf
Copy link
Copy Markdown
Member

oschaaf commented Jun 24, 2020

Well apparently /retest doesn’t work for some reason, I’ll respin the jobs manually then.
The ci tasks failed after a couple of minutes, I have never seen that.

@oschaaf
Copy link
Copy Markdown
Member

oschaaf commented Jun 24, 2020

(In instances where they flake as observed in other cases, they have been running much longer)

@oschaaf
Copy link
Copy Markdown
Member

oschaaf commented Jun 24, 2020

CircleCI doesn’t allow me to restart the jobs in its ui, those options are grayed out.

@qqustc qqustc requested a review from oschaaf June 24, 2020 19:20
@qqustc qqustc changed the title [Draft]Add counter/histogram to record total number of requests and latency metrics in NH Add counter/histogram to record total number of requests and latency metrics in NH Jun 24, 2020
@qqustc qqustc added P1 waiting-for-review A PR waiting for a review. labels Jun 24, 2020
Copy link
Copy Markdown
Member

@oschaaf oschaaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, posting a round of comments and requesting feedback on an idea in there.

COUNTER(pool_overflow) \
COUNTER(pool_connection_failure)
COUNTER(pool_connection_failure) \
COUNTER(total_req_sent) \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from upstream_rq_total tracked by Envoy?

Copy link
Copy Markdown
Contributor Author

@qqustc qqustc Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seams upstream_rq_total is per cluster (I'm not sure what cluster means in Nighthawk) and here total_req_sent is per worker.
If there is no difference between those 2 counters to the user, we should reuse upstream_rq_total.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We set up a single cluster per worker today, so I think effectively the counters are tracking the same thing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Otto for the explanation! Decided not to add total_req_sent since it is equivalent to upstream_rq_total .

if (response_code > 199 && response_code <= 299) {
benchmark_client_stats_.latency_on_success_req_.recordValue(latency_us);
} else {
benchmark_client_stats_.latency_on_error_req_.recordValue(latency_us);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is a one size fits all. This might often suffice, but could there be a case where one would like to control the status code aggregation more? E.g. bucketize 4xx and 5xx? One might be a missing file, handled by the webserver itself, whereas the other could be an application crashing out, both would have their own latency characteristics.

Copy link
Copy Markdown
Contributor Author

@qqustc qqustc Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that we should add separate latency metrics for each code if it provides more value.
Is it better to have all (latency_1xx, latency_2xx, latency_3xx, latency_4xx, latency_5xx, latency_xxx) for uniformity rather than just adding (latency_4xx, latency_5xx)?

The previous reasoning was to wait after Envoy support metrics with key-value label to add a single latency with different codes as label and currently only add 2 to save memory (if that is a concern).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be consistent with the counters we track too (and actually make them sort of redundant?).

Obviously out of scope here, but still, I'm kind of wondering if we could/should somehow attempt to generalise the concept of grouping metrics based on user-provided configuration. If we could do that today, the initial grouping proposed here would be a nice conservative default. So, thinking about it, maybe it is providing the means to change the default that I was concerned about here, and not so much the specific default. Maybe we shouldn't be discussing that here in this PR, but elsewhere :-) I'll defer to you for the specific default we choose here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Otto. Added latency statistics latency_1xx, latency_2xx, latency_3xx, latency_4xx, latency_5xx, latency_xxx.

COUNTER(pool_connection_failure)
COUNTER(pool_connection_failure) \
COUNTER(total_req_sent) \
HISTOGRAM(latency_on_success_req, Microseconds) \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe include http_status in the name, as that's what is being used to define success/error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obsolete.

time_source_.monotonicTime() - request_start_)
.count();
decoder_completion_callback_.exportLatency(stream_info_.response_code_.value(), latency_us);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a log line add value here in an else clause? That might need to guard against high frequency logging, but there might be something special going on when we'd hit that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Will add a sparse log.

Copy link
Copy Markdown
Collaborator

@mum4k mum4k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review (documentation only for now).

-----| ----- | ----------------
total_req_sent | Counter | Total number of requests sent from Nighthawk
http_xxx | Counter | Total number of response with code xxx
stream_resets | Counter | Total number of sream reset
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: stream

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Name | Type | Description
-----| ----- | ----------------
total_req_sent | Counter | Total number of requests sent from Nighthawk
http_xxx | Counter | Total number of response with code xxx
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more useful to list the counters explicitly? How many are we planning to export, is it too many?

If we do list them explicitly, users can use this as a reference when looking for the complete list of metrics.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to list the counters explicitly. The plan is to convert all counters with different codes into a single streamz in customized Sinks (similar to what is in Monarch Sink).

# Nighthawk Statistics

## Background
Currently Nighthawk only outputs metrics at the end of a test run. There are no metrics streamed during a test run. In order to collect all the useful Nighthawk metrics, we plan to instrument Nighthawk with the functionality to stream its metrics.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the future tense is a good choice here. We will end up implementing this at some point and this document will still talk about it as something we only plan to work on.

How would you feel about writing this document as a documentation for metrics that are already supported (as in already done) and if you want we could add a TODO or a note that this is work in progress. We can remove that note once done.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jakub for the suggestion! Agreed that it is better to write this document as a documentation for metrics that are already supported.

Updated the doc to document metrics that are already supported and added a note saying The work to stream its metrics is in progress.

In Envoy, the stat [Store](https://github.com/envoyproxy/envoy/blob/74530c92cfa3682b49b540fddf2aba40ac10c68e/include/envoy/stats/store.h#L29) is a singleton and provides a simple interface by which the rest of the code can obtain handles to scopes, counters, gauges, and histograms. Envoy counters and gauges are periodically (configured at ~5 sec interval) flushed to the sinks. Note that currently histogram values are sent directly to the sinks. A stat [Sink](https://github.com/envoyproxy/envoy/blob/74530c92cfa3682b49b540fddf2aba40ac10c68e/include/envoy/stats/sink.h#L48) is an interface that takes generic stat data and translates it into a backend-specific wire format. Currently Envoy supports the TCP and UDP [statsd](https://github.com/b/statsd_spec) protocol (implemented in [statsd.h](https://github.com/envoyproxy/envoy/blob/master/source/extensions/stat_sinks/common/statsd/statsd.h)). Users can create their own Sink subclass to translate Envoy metrics into backend-specific format.

Envoy metrics can be defined using a macro, e.g.
```
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add syntax highlighting here, i.e. "cc" instead of just ""?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mum4k for the note. This is really useful to know. Done.



## Metrics Export in Nighthawk
Currently a single Nighthawk can run with multiple workers. In the future, Nighthawk will be extended to be able to run multiple instances together. Since each Nighthawk worker sends requests independently, we decide to export per worker level metrics since it provides several advantages over global level metrics (aggregated across all workers). Notice that *there are existing Nighthawk metrics already defined at per worker level* ([example](https://github.com/envoyproxy/nighthawk/blob/bc72a52efdc489beaa0844b34f136e03394bd355/source/client/benchmark_client_impl.cc#L61)).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible typo: "we decided".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1,63 @@
# Nighthawk Statistics
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we format this document so that each line is at most 80 characters?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@mum4k mum4k added waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. and removed waiting-for-review A PR waiting for a review. labels Jun 25, 2020
qqustc and others added 4 commits June 30, 2020 17:04
Signed-off-by: qqustc <qqin@google.com>
…Statistic

Signed-off-by: Qin Qin <qqin@google.com>

Signed-off-by: qqustc <qqustc@gmail.com>
Signed-off-by: Qin Qin <qqin@google.com>
Signed-off-by: Qin Qin <qqin@google.com>
Signed-off-by: Qin Qin <qqin@google.com>
Copy link
Copy Markdown
Collaborator

@mum4k mum4k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I note this PR is adding a Submodule called "nighthawk". Is that intentional and can you help me understand what it is for?

};

// For histogram metrics, Nighthawk Statistic is used instead of Envoy Histogram.
struct BenchmarkClientStatistic {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the relationship between struct BenchmarkClientStats and struct BenchmarkClientStatistic? They have very similar names, yet don't look directly related.

Can we improve the readability by either or both:

  • choosing better name for the new struct, one that would clearly distinguish it from the existing one.
  • adding a comment here and above struct BenchmarkClientStats explaining the difference.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment explaining the difference.

For counter metrics, Nighthawk use Envoy Counter directly. For histogram metrics, Nighthawk use its own Statistic instead of Envoy Histogram. Here BenchmarkClientStats contains only counters while BenchmarkClientStatistic contains only histograms.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment helps here, but not elsewhere in the code where this exists. Could we rename these BenchmarkClientCounters and BenchmarkClientHistograms? Or something else that distinguishes their purposes in the names themselves.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nate. Renamed BenchmarkClientStats to BenchmarkClientCounters to avoid confusion (while keep BenchmarkClientStatistic unchanged since some of its members are StreamingStatistic which is not exactly histograms)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Thanks for the effort here!

StatisticPtr response_statistic_;
StatisticPtr response_header_size_statistic_;
StatisticPtr response_body_size_statistic_;
StatisticPtr latency_1xx_statistic_;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we list these here individually? Can we reuse the struct BenchmarkClientStatistic defined above?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, changed to reuse BenchmarkClientStatistic struct here.

Envoy::Upstream::ClusterManagerPtr& cluster_manager,
Envoy::Tracing::HttpTracerSharedPtr& http_tracer, absl::string_view cluster_name,
RequestSource& request_generator) const {
const int sink_stat_prefix, RequestSource& request_generator) const {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need to mark an int argument const, since int being a basic type is immutable (passed by value).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// NullStatistic).
// TODO(#292): Create options and have the StatisticFactory consider those when instantiating
// statistics.
BenchmarkClientStatistic statistic;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to define a constructor for this struct right next to it and just call the constructor here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Added a constructor for the struct. Yes, this makes the code simpler.

Envoy::Upstream::ClusterManagerPtr& cluster_manager,
Envoy::Tracing::HttpTracerSharedPtr& http_tracer,
absl::string_view cluster_name,
absl::string_view cluster_name, const int sink_stat_prefix,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment about unnecessary const qualifier next to an int argument applies here as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

{{":scheme", "http"}, {":method", "GET"}, {":path", "/"}, {":host", "localhost"}}));
return std::make_unique<RequestImpl>(header);
};
statistic_.connect_statistic = std::make_unique<StreamingStatistic>();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment about having a constructor next to the struct itself applies here too. Assuming that the differences between the two call sites can be parameterized.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

EXPECT_EQ(10, client_->statistics()["benchmark_http_client.latency_2xx"]->count());
}

TEST_F(BenchmarkClientHttpTest, ExportSuccessLatency) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I note the test coverage for the latency metrics.

How are we testing population and propagation of the other fields on struct BenchmarkClientStatistic?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests on connect_statistic and response_statistic are already in test case EnableLatencyMeasurement. Added test for response_header_size_statistic and response_body_size_statistic there too.

@mum4k mum4k added waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. and removed waiting-for-review A PR waiting for a review. labels Jul 17, 2020
@qqustc qqustc changed the title Add counter/histogram to record total number of requests and latency metrics in NH Add stats to record latency metrics for each response code in NH Jul 20, 2020
qqustc added 3 commits July 20, 2020 14:30
Signed-off-by: Qin Qin <qqin@google.com>
Signed-off-by: Qin Qin <qqin@google.com>
Signed-off-by: Qin Qin <qqin@google.com>
@qqustc qqustc added waiting-for-review A PR waiting for a review. and removed waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. labels Jul 21, 2020
@qqustc
Copy link
Copy Markdown
Contributor Author

qqustc commented Jul 21, 2020

I note this PR is adding a Submodule called "nighthawk". Is that intentional and can you help me understand what it is for?

Thanks Jakub. This was not intentional and it has been deleted.

@qqustc qqustc requested a review from mum4k July 21, 2020 14:00
@mum4k mum4k requested a review from dubious90 July 21, 2020 19:41
@mum4k
Copy link
Copy Markdown
Collaborator

mum4k commented Jul 21, 2020

Thanks for implementing the changes @qqustc.

@dubious90 please review and assign back to me once done.
@oschaaf please take another look as there were significant changes after your review.

@mum4k mum4k removed their assignment Jul 21, 2020
Copy link
Copy Markdown
Member

@oschaaf oschaaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're now always instantiating SinkableHdrStatistic instances from the factory, but there's no sink config involved. I'm curious: how will that behave? Do we need to consider adding options in this PR to instantiate SinkableXXXStatistic or just XXXStatistic before merging this? E.g. when an option like --sink <address:port> is given?

@qqustc
Copy link
Copy Markdown
Contributor Author

qqustc commented Jul 21, 2020

It looks like we're now always instantiating SinkableHdrStatistic instances from the factory, but there's no sink config involved. I'm curious: how will that behave? Do we need to consider adding options in this PR to instantiate SinkableXXXStatistic or just XXXStatistic before merging this? E.g. when an option like --sink <address:port> is given?

Thanks Otto. When there is no sink configured, SinkableHdrStatistic just behave the same as HdrStatistic (deliverHistogramToSinks in SinkableHdrStatistic::recordValue does nothing). So we don't need to add option in this PR to decide whether instantiate SinkableXXXStatistic or just XXXStatistic.

@oschaaf
Copy link
Copy Markdown
Member

oschaaf commented Jul 21, 2020

Thanks Otto. When there is no sink configured, SinkableHdrStatistic just behave the same as HdrStatistic (deliverHistogramToSinks in SinkableHdrStatistic::recordValue does nothing). So we don't need to add option in this PR to decide whether instantiate SinkableXXXStatistic or just XXXStatistic.

Great, thanks for the explanation.

oschaaf
oschaaf previously approved these changes Jul 21, 2020
benchmark_http_client.latency_5xx | HdrStatistic | Latency (in Nanosecond) histogram of request with code 5xx
benchmark_http_client.latency_xxx | HdrStatistic | Latency (in Nanosecond) histogram of request with code <100 or >=600
benchmark_http_client.queue_to_connect | HdrStatistic | Histogram of request connection time (in Nanosecond)
benchmark_http_client.request_to_response | HdrStatistic | Latency (in Nanosecond) histogram of all requests (include requests with stream reset or pool failure)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"histogram of all requests" feels like the wrong clarification here. All of the histographs deal with all of the requests, right? Would this be better as "histogram of full request time"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other latency_*** histograms only collect values for requests (where response with a code successfully returned), while the request_to_response histogram also include requests with stream reset or pool failure.
Updated to Latency (in Nanosecond) histogram include requests with stream reset or pool failure

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Thanks for the clarification.

benchmark_http_client.latency_xxx | HdrStatistic | Latency (in Nanosecond) histogram of request with code <100 or >=600
benchmark_http_client.queue_to_connect | HdrStatistic | Histogram of request connection time (in Nanosecond)
benchmark_http_client.request_to_response | HdrStatistic | Latency (in Nanosecond) histogram of all requests (include requests with stream reset or pool failure)
benchmark_http_client.response_header_size | StreamingStatistic | Statistic of response header size (min, max, mean, pstdev values in bytes)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me reading this what these statistics are going to look like. If HdrStatistic is a histogram, what is StreamingStatistic? If it's also a histogram, how does it differ from HdrStatistic?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both StreamingStatistic and HdrStatistic are implementations of NH Statistic. The difference is HdrStatistic is a "real" histogram (it stores the distributions and provide detailed percentile values) while StreamingStatistic is not exactly a histogram (it only provide min, max, mean, pstdev values and does not have percentile values)

As shown in this table, different NH metrics choose different implementations of NH Statistic.

In Envoy, the stat
[Store](https://github.com/envoyproxy/envoy/blob/74530c92cfa3682b49b540fddf2aba40ac10c68e/include/envoy/stats/store.h#L29)
is a singleton and provides a simple interface by which the rest of the code can
obtain handles to scopes, counters, gauges, and histograms. Envoy counters and
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it sticks out that this is undefined, while counters, gauges, and histograms are defined above.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Store is not a type of metrics so I decide not to list its definition above with other metrics. Since it is a (common) Envoy concept and its definition is not closely relevant in the NH Statistics doc, so only a link to the code definition is provided here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I was unclear here. I was actually referring to scopes. I realize that scopes are probably not metrics themselves, but I'm not sure what they are in this context, and unlike store, we don't have a link to envoy documentation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a code link to scopes too.

Users can create their own Sink subclass to translate Envoy metrics into
backend-specific format.

Envoy metrics can be defined using a macro, e.g.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is helpful, but also opens up other questions. We're defining metrics here, but they're just names. How are they calculated? Can you add a second code snippet below it, explaining how to collect stats?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great. thank you

Currently Envoy metrics don't support key-value map. As a result, for metrics to
be broken down by certain dimensions, we need to define a separate metric for
each dimension. For example, currently Nighthawk defines separate
[counters](https://github.com/envoyproxy/nighthawk/blob/master/source/client/benchmark_client_impl.h#L35-L40)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think your intention would be clearer if you changed what you're bracketing:
currently Nighthawk defines [separate counters]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

namespace Nighthawk {
namespace Client {

BenchmarkClientStatistic::BenchmarkClientStatistic(BenchmarkClientStatistic& statistic)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably misunderstanding, but this appears to be a copy constructor that moves all of the underlying fields, which seems unideal.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nate for notice this. Changed to move constructor.


void BenchmarkClientHttpImpl::exportLatency(const uint32_t response_code,
const uint64_t latency_ns) {
if (response_code > 99 && response_code <= 199) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: it seems very odd to me to notate this expression this way. It's more conventional to do:
response_code >= 100 && response_code < 200

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I was following the code here.

if (status > 99 && status <= 199) {
benchmark_client_stats_.http_1xx_.inc();
} else if (status > 199 && status <= 299) {
benchmark_client_stats_.http_2xx_.inc();
} else if (status > 299 && status <= 399) {
benchmark_client_stats_.http_3xx_.inc();
} else if (status > 399 && status <= 499) {
benchmark_client_stats_.http_4xx_.inc();
} else if (status > 499 && status <= 599) {
benchmark_client_stats_.http_5xx_.inc();
} else {
benchmark_client_stats_.http_xxx_.inc();
}

COUNTER(pool_overflow) \
COUNTER(pool_connection_failure)

// For counter metrics, Nighthawk use Envoy Counter directly. For histogram metrics, Nighthawk use
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: grammar - Nighthawk uses

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

};

// For histogram metrics, Nighthawk Statistic is used instead of Envoy Histogram.
struct BenchmarkClientStatistic {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment helps here, but not elsewhere in the code where this exists. Could we rename these BenchmarkClientCounters and BenchmarkClientHistograms? Or something else that distinguishes their purposes in the names themselves.

Envoy::Upstream::ClusterManagerPtr& cluster_manager,
Envoy::Tracing::HttpTracerSharedPtr& http_tracer, absl::string_view cluster_name,
RequestSource& request_generator) const {
int sink_stat_prefix, RequestSource& request_generator) const {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's possible this is also true elsewhere in this PR, but is it desirable for us to only allow an int as the sink_stat_prefix? In the future, isn't it possible that we'd want to use a prefix that isn't just the worker number?

If not, then this really isn't just a prefix. It has the specific meaning of being a worker number.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. Decide to change sink_stat_prefix to worker_id to make it explicitly that this will be used to populate worker_id field in SinkableHdrStatistic

std::make_unique<SinkableHdrStatistic>(scope, worker_id),
std::make_unique<SinkableHdrStatistic>(scope, worker_id),
std::make_unique<SinkableHdrStatistic>(scope, worker_id),
std::make_unique<SinkableHdrStatistic>(scope, worker_id),
std::make_unique<SinkableHdrStatistic>(scope, worker_id),
std::make_unique<SinkableHdrStatistic>(scope, worker_id));

\cc @oschaaf

Signed-off-by: Qin Qin <qqin@google.com>
@qqustc
Copy link
Copy Markdown
Contributor Author

qqustc commented Jul 22, 2020

Thanks @dubious90 for the review! The PR is ready for review again.

Copy link
Copy Markdown
Contributor

@dubious90 dubious90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, mumak to approve

In Envoy, the stat
[Store](https://github.com/envoyproxy/envoy/blob/74530c92cfa3682b49b540fddf2aba40ac10c68e/include/envoy/stats/store.h#L29)
is a singleton and provides a simple interface by which the rest of the code can
obtain handles to scopes, counters, gauges, and histograms. Envoy counters and
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I was unclear here. I was actually referring to scopes. I realize that scopes are probably not metrics themselves, but I'm not sure what they are in this context, and unlike store, we don't have a link to envoy documentation.

Signed-off-by: Qin Qin <qqin@google.com>
@qqustc
Copy link
Copy Markdown
Contributor Author

qqustc commented Jul 23, 2020

/retest

@repokitteh-read-only
Copy link
Copy Markdown

🔨 rebuilding ci/circleci: test_gcc (failed build)

🐱

Caused by: a #381 (comment) was created by @qqustc.

see: more, trace.

@mum4k mum4k merged commit 58c5fb6 into envoyproxy:master Jul 23, 2020
@qqustc qqustc deleted the pr3 branch July 23, 2020 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P1 waiting-for-review A PR waiting for a review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants