Add stats to record latency metrics for each response code in NH#381
Add stats to record latency metrics for each response code in NH#381mum4k merged 49 commits intoenvoyproxy:masterfrom
Conversation
…metrics in NH (#1) * Create statistics.md * Update benchmark_client_impl.cc * Update benchmark_client_impl.h * Update stream_decoder.cc * Update stream_decoder.h * Update BUILD * Update benchmark_http_client_test.cc * Update test_integration_basics.py * Update stream_decoder_test.cc * local commit 1 Signed-off-by: qqustc <qqin@google.com> Commit from master Signed-off-by: qqustc <qqustc@gmail.com>
Signed-off-by: qqustc <qqin@google.com> Signed-off-by: qqustc <qqustc@gmail.com>
Signed-off-by: qqustc <qqin@google.com>
|
Could the abrupt socket termination complaints in the longer running ci tasks be associated to a subprocess being killed because of the closing/re-opening of the PR here? /retest |
I don't think so. When I closed and reopened the PR, there was no CI tasks running. The CI tasks only started to run after I made the last commit. |
|
Well apparently /retest doesn’t work for some reason, I’ll respin the jobs manually then. |
|
(In instances where they flake as observed in other cases, they have been running much longer) |
|
CircleCI doesn’t allow me to restart the jobs in its ui, those options are grayed out. |
oschaaf
left a comment
There was a problem hiding this comment.
Thanks, posting a round of comments and requesting feedback on an idea in there.
| COUNTER(pool_overflow) \ | ||
| COUNTER(pool_connection_failure) | ||
| COUNTER(pool_connection_failure) \ | ||
| COUNTER(total_req_sent) \ |
There was a problem hiding this comment.
How is this different from upstream_rq_total tracked by Envoy?
There was a problem hiding this comment.
it seams upstream_rq_total is per cluster (I'm not sure what cluster means in Nighthawk) and here total_req_sent is per worker.
If there is no difference between those 2 counters to the user, we should reuse upstream_rq_total.
There was a problem hiding this comment.
We set up a single cluster per worker today, so I think effectively the counters are tracking the same thing.
There was a problem hiding this comment.
Thanks Otto for the explanation! Decided not to add total_req_sent since it is equivalent to upstream_rq_total .
| if (response_code > 199 && response_code <= 299) { | ||
| benchmark_client_stats_.latency_on_success_req_.recordValue(latency_us); | ||
| } else { | ||
| benchmark_client_stats_.latency_on_error_req_.recordValue(latency_us); |
There was a problem hiding this comment.
I wonder if this is a one size fits all. This might often suffice, but could there be a case where one would like to control the status code aggregation more? E.g. bucketize 4xx and 5xx? One might be a missing file, handled by the webserver itself, whereas the other could be an application crashing out, both would have their own latency characteristics.
There was a problem hiding this comment.
Agree that we should add separate latency metrics for each code if it provides more value.
Is it better to have all (latency_1xx, latency_2xx, latency_3xx, latency_4xx, latency_5xx, latency_xxx) for uniformity rather than just adding (latency_4xx, latency_5xx)?
The previous reasoning was to wait after Envoy support metrics with key-value label to add a single latency with different codes as label and currently only add 2 to save memory (if that is a concern).
There was a problem hiding this comment.
That would be consistent with the counters we track too (and actually make them sort of redundant?).
Obviously out of scope here, but still, I'm kind of wondering if we could/should somehow attempt to generalise the concept of grouping metrics based on user-provided configuration. If we could do that today, the initial grouping proposed here would be a nice conservative default. So, thinking about it, maybe it is providing the means to change the default that I was concerned about here, and not so much the specific default. Maybe we shouldn't be discussing that here in this PR, but elsewhere :-) I'll defer to you for the specific default we choose here.
There was a problem hiding this comment.
Thanks Otto. Added latency statistics latency_1xx, latency_2xx, latency_3xx, latency_4xx, latency_5xx, latency_xxx.
| COUNTER(pool_connection_failure) | ||
| COUNTER(pool_connection_failure) \ | ||
| COUNTER(total_req_sent) \ | ||
| HISTOGRAM(latency_on_success_req, Microseconds) \ |
There was a problem hiding this comment.
Maybe include http_status in the name, as that's what is being used to define success/error?
| time_source_.monotonicTime() - request_start_) | ||
| .count(); | ||
| decoder_completion_callback_.exportLatency(stream_info_.response_code_.value(), latency_us); | ||
| } |
There was a problem hiding this comment.
Would a log line add value here in an else clause? That might need to guard against high frequency logging, but there might be something special going on when we'd hit that?
There was a problem hiding this comment.
Good point. Will add a sparse log.
mum4k
left a comment
There was a problem hiding this comment.
Partial review (documentation only for now).
docs/root/statistics.md
Outdated
| -----| ----- | ---------------- | ||
| total_req_sent | Counter | Total number of requests sent from Nighthawk | ||
| http_xxx | Counter | Total number of response with code xxx | ||
| stream_resets | Counter | Total number of sream reset |
docs/root/statistics.md
Outdated
| Name | Type | Description | ||
| -----| ----- | ---------------- | ||
| total_req_sent | Counter | Total number of requests sent from Nighthawk | ||
| http_xxx | Counter | Total number of response with code xxx |
There was a problem hiding this comment.
Would it be more useful to list the counters explicitly? How many are we planning to export, is it too many?
If we do list them explicitly, users can use this as a reference when looking for the complete list of metrics.
There was a problem hiding this comment.
Updated to list the counters explicitly. The plan is to convert all counters with different codes into a single streamz in customized Sinks (similar to what is in Monarch Sink).
docs/root/statistics.md
Outdated
| # Nighthawk Statistics | ||
|
|
||
| ## Background | ||
| Currently Nighthawk only outputs metrics at the end of a test run. There are no metrics streamed during a test run. In order to collect all the useful Nighthawk metrics, we plan to instrument Nighthawk with the functionality to stream its metrics. |
There was a problem hiding this comment.
Not sure if the future tense is a good choice here. We will end up implementing this at some point and this document will still talk about it as something we only plan to work on.
How would you feel about writing this document as a documentation for metrics that are already supported (as in already done) and if you want we could add a TODO or a note that this is work in progress. We can remove that note once done.
There was a problem hiding this comment.
Thanks Jakub for the suggestion! Agreed that it is better to write this document as a documentation for metrics that are already supported.
Updated the doc to document metrics that are already supported and added a note saying The work to stream its metrics is in progress.
docs/root/statistics.md
Outdated
| In Envoy, the stat [Store](https://github.com/envoyproxy/envoy/blob/74530c92cfa3682b49b540fddf2aba40ac10c68e/include/envoy/stats/store.h#L29) is a singleton and provides a simple interface by which the rest of the code can obtain handles to scopes, counters, gauges, and histograms. Envoy counters and gauges are periodically (configured at ~5 sec interval) flushed to the sinks. Note that currently histogram values are sent directly to the sinks. A stat [Sink](https://github.com/envoyproxy/envoy/blob/74530c92cfa3682b49b540fddf2aba40ac10c68e/include/envoy/stats/sink.h#L48) is an interface that takes generic stat data and translates it into a backend-specific wire format. Currently Envoy supports the TCP and UDP [statsd](https://github.com/b/statsd_spec) protocol (implemented in [statsd.h](https://github.com/envoyproxy/envoy/blob/master/source/extensions/stat_sinks/common/statsd/statsd.h)). Users can create their own Sink subclass to translate Envoy metrics into backend-specific format. | ||
|
|
||
| Envoy metrics can be defined using a macro, e.g. | ||
| ``` |
There was a problem hiding this comment.
Do we want to add syntax highlighting here, i.e. "cc" instead of just ""?
There was a problem hiding this comment.
Thanks @mum4k for the note. This is really useful to know. Done.
docs/root/statistics.md
Outdated
|
|
||
|
|
||
| ## Metrics Export in Nighthawk | ||
| Currently a single Nighthawk can run with multiple workers. In the future, Nighthawk will be extended to be able to run multiple instances together. Since each Nighthawk worker sends requests independently, we decide to export per worker level metrics since it provides several advantages over global level metrics (aggregated across all workers). Notice that *there are existing Nighthawk metrics already defined at per worker level* ([example](https://github.com/envoyproxy/nighthawk/blob/bc72a52efdc489beaa0844b34f136e03394bd355/source/client/benchmark_client_impl.cc#L61)). |
There was a problem hiding this comment.
possible typo: "we decided".
| @@ -0,0 +1,63 @@ | |||
| # Nighthawk Statistics | |||
There was a problem hiding this comment.
Can we format this document so that each line is at most 80 characters?
Signed-off-by: qqustc <qqin@google.com>
…Statistic Signed-off-by: Qin Qin <qqin@google.com> Signed-off-by: qqustc <qqustc@gmail.com> Signed-off-by: Qin Qin <qqin@google.com>
Signed-off-by: Qin Qin <qqin@google.com>
Signed-off-by: Qin Qin <qqin@google.com>
mum4k
left a comment
There was a problem hiding this comment.
I note this PR is adding a Submodule called "nighthawk". Is that intentional and can you help me understand what it is for?
| }; | ||
|
|
||
| // For histogram metrics, Nighthawk Statistic is used instead of Envoy Histogram. | ||
| struct BenchmarkClientStatistic { |
There was a problem hiding this comment.
What is the relationship between struct BenchmarkClientStats and struct BenchmarkClientStatistic? They have very similar names, yet don't look directly related.
Can we improve the readability by either or both:
- choosing better name for the new struct, one that would clearly distinguish it from the existing one.
- adding a comment here and above
struct BenchmarkClientStatsexplaining the difference.
There was a problem hiding this comment.
Added a comment explaining the difference.
For counter metrics, Nighthawk use Envoy Counter directly. For histogram metrics, Nighthawk use its own Statistic instead of Envoy Histogram. Here BenchmarkClientStats contains only counters while BenchmarkClientStatistic contains only histograms.
There was a problem hiding this comment.
I think the comment helps here, but not elsewhere in the code where this exists. Could we rename these BenchmarkClientCounters and BenchmarkClientHistograms? Or something else that distinguishes their purposes in the names themselves.
There was a problem hiding this comment.
Thanks Nate. Renamed BenchmarkClientStats to BenchmarkClientCounters to avoid confusion (while keep BenchmarkClientStatistic unchanged since some of its members are StreamingStatistic which is not exactly histograms)
There was a problem hiding this comment.
Makes sense to me. Thanks for the effort here!
| StatisticPtr response_statistic_; | ||
| StatisticPtr response_header_size_statistic_; | ||
| StatisticPtr response_body_size_statistic_; | ||
| StatisticPtr latency_1xx_statistic_; |
There was a problem hiding this comment.
Why do we list these here individually? Can we reuse the struct BenchmarkClientStatistic defined above?
There was a problem hiding this comment.
Yes, changed to reuse BenchmarkClientStatistic struct here.
source/client/factories_impl.cc
Outdated
| Envoy::Upstream::ClusterManagerPtr& cluster_manager, | ||
| Envoy::Tracing::HttpTracerSharedPtr& http_tracer, absl::string_view cluster_name, | ||
| RequestSource& request_generator) const { | ||
| const int sink_stat_prefix, RequestSource& request_generator) const { |
There was a problem hiding this comment.
We probably don't need to mark an int argument const, since int being a basic type is immutable (passed by value).
source/client/factories_impl.cc
Outdated
| // NullStatistic). | ||
| // TODO(#292): Create options and have the StatisticFactory consider those when instantiating | ||
| // statistics. | ||
| BenchmarkClientStatistic statistic; |
There was a problem hiding this comment.
Would it make sense to define a constructor for this struct right next to it and just call the constructor here?
There was a problem hiding this comment.
Thanks for the suggestion. Added a constructor for the struct. Yes, this makes the code simpler.
source/client/factories_impl.h
Outdated
| Envoy::Upstream::ClusterManagerPtr& cluster_manager, | ||
| Envoy::Tracing::HttpTracerSharedPtr& http_tracer, | ||
| absl::string_view cluster_name, | ||
| absl::string_view cluster_name, const int sink_stat_prefix, |
There was a problem hiding this comment.
Comment about unnecessary const qualifier next to an int argument applies here as well.
test/benchmark_http_client_test.cc
Outdated
| {{":scheme", "http"}, {":method", "GET"}, {":path", "/"}, {":host", "localhost"}})); | ||
| return std::make_unique<RequestImpl>(header); | ||
| }; | ||
| statistic_.connect_statistic = std::make_unique<StreamingStatistic>(); |
There was a problem hiding this comment.
Comment about having a constructor next to the struct itself applies here too. Assuming that the differences between the two call sites can be parameterized.
| EXPECT_EQ(10, client_->statistics()["benchmark_http_client.latency_2xx"]->count()); | ||
| } | ||
|
|
||
| TEST_F(BenchmarkClientHttpTest, ExportSuccessLatency) { |
There was a problem hiding this comment.
I note the test coverage for the latency metrics.
How are we testing population and propagation of the other fields on struct BenchmarkClientStatistic?
There was a problem hiding this comment.
The tests on connect_statistic and response_statistic are already in test case EnableLatencyMeasurement. Added test for response_header_size_statistic and response_body_size_statistic there too.
Signed-off-by: Qin Qin <qqin@google.com>
Signed-off-by: Qin Qin <qqin@google.com>
Thanks Jakub. This was not intentional and it has been deleted. |
|
Thanks for implementing the changes @qqustc. @dubious90 please review and assign back to me once done. |
oschaaf
left a comment
There was a problem hiding this comment.
It looks like we're now always instantiating SinkableHdrStatistic instances from the factory, but there's no sink config involved. I'm curious: how will that behave? Do we need to consider adding options in this PR to instantiate SinkableXXXStatistic or just XXXStatistic before merging this? E.g. when an option like --sink <address:port> is given?
Thanks Otto. When there is no sink configured, |
Great, thanks for the explanation. |
docs/root/statistics.md
Outdated
| benchmark_http_client.latency_5xx | HdrStatistic | Latency (in Nanosecond) histogram of request with code 5xx | ||
| benchmark_http_client.latency_xxx | HdrStatistic | Latency (in Nanosecond) histogram of request with code <100 or >=600 | ||
| benchmark_http_client.queue_to_connect | HdrStatistic | Histogram of request connection time (in Nanosecond) | ||
| benchmark_http_client.request_to_response | HdrStatistic | Latency (in Nanosecond) histogram of all requests (include requests with stream reset or pool failure) |
There was a problem hiding this comment.
"histogram of all requests" feels like the wrong clarification here. All of the histographs deal with all of the requests, right? Would this be better as "histogram of full request time"?
There was a problem hiding this comment.
All other latency_*** histograms only collect values for requests (where response with a code successfully returned), while the request_to_response histogram also include requests with stream reset or pool failure.
Updated to Latency (in Nanosecond) histogram include requests with stream reset or pool failure
There was a problem hiding this comment.
Understood. Thanks for the clarification.
| benchmark_http_client.latency_xxx | HdrStatistic | Latency (in Nanosecond) histogram of request with code <100 or >=600 | ||
| benchmark_http_client.queue_to_connect | HdrStatistic | Histogram of request connection time (in Nanosecond) | ||
| benchmark_http_client.request_to_response | HdrStatistic | Latency (in Nanosecond) histogram of all requests (include requests with stream reset or pool failure) | ||
| benchmark_http_client.response_header_size | StreamingStatistic | Statistic of response header size (min, max, mean, pstdev values in bytes) |
There was a problem hiding this comment.
It's not clear to me reading this what these statistics are going to look like. If HdrStatistic is a histogram, what is StreamingStatistic? If it's also a histogram, how does it differ from HdrStatistic?
There was a problem hiding this comment.
Both StreamingStatistic and HdrStatistic are implementations of NH Statistic. The difference is HdrStatistic is a "real" histogram (it stores the distributions and provide detailed percentile values) while StreamingStatistic is not exactly a histogram (it only provide min, max, mean, pstdev values and does not have percentile values)
As shown in this table, different NH metrics choose different implementations of NH Statistic.
docs/root/statistics.md
Outdated
| In Envoy, the stat | ||
| [Store](https://github.com/envoyproxy/envoy/blob/74530c92cfa3682b49b540fddf2aba40ac10c68e/include/envoy/stats/store.h#L29) | ||
| is a singleton and provides a simple interface by which the rest of the code can | ||
| obtain handles to scopes, counters, gauges, and histograms. Envoy counters and |
There was a problem hiding this comment.
it sticks out that this is undefined, while counters, gauges, and histograms are defined above.
There was a problem hiding this comment.
Store is not a type of metrics so I decide not to list its definition above with other metrics. Since it is a (common) Envoy concept and its definition is not closely relevant in the NH Statistics doc, so only a link to the code definition is provided here.
There was a problem hiding this comment.
Sorry I was unclear here. I was actually referring to scopes. I realize that scopes are probably not metrics themselves, but I'm not sure what they are in this context, and unlike store, we don't have a link to envoy documentation.
There was a problem hiding this comment.
Added a code link to scopes too.
| Users can create their own Sink subclass to translate Envoy metrics into | ||
| backend-specific format. | ||
|
|
||
| Envoy metrics can be defined using a macro, e.g. |
There was a problem hiding this comment.
This example is helpful, but also opens up other questions. We're defining metrics here, but they're just names. How are they calculated? Can you add a second code snippet below it, explaining how to collect stats?
docs/root/statistics.md
Outdated
| Currently Envoy metrics don't support key-value map. As a result, for metrics to | ||
| be broken down by certain dimensions, we need to define a separate metric for | ||
| each dimension. For example, currently Nighthawk defines separate | ||
| [counters](https://github.com/envoyproxy/nighthawk/blob/master/source/client/benchmark_client_impl.h#L35-L40) |
There was a problem hiding this comment.
nit: I think your intention would be clearer if you changed what you're bracketing:
currently Nighthawk defines [separate counters]
| namespace Nighthawk { | ||
| namespace Client { | ||
|
|
||
| BenchmarkClientStatistic::BenchmarkClientStatistic(BenchmarkClientStatistic& statistic) |
There was a problem hiding this comment.
I'm probably misunderstanding, but this appears to be a copy constructor that moves all of the underlying fields, which seems unideal.
There was a problem hiding this comment.
Thanks Nate for notice this. Changed to move constructor.
|
|
||
| void BenchmarkClientHttpImpl::exportLatency(const uint32_t response_code, | ||
| const uint64_t latency_ns) { | ||
| if (response_code > 99 && response_code <= 199) { |
There was a problem hiding this comment.
optional: it seems very odd to me to notate this expression this way. It's more conventional to do:
response_code >= 100 && response_code < 200
There was a problem hiding this comment.
ah, I was following the code here.
nighthawk/source/client/benchmark_client_impl.cc
Lines 140 to 152 in c57811f
| COUNTER(pool_overflow) \ | ||
| COUNTER(pool_connection_failure) | ||
|
|
||
| // For counter metrics, Nighthawk use Envoy Counter directly. For histogram metrics, Nighthawk use |
There was a problem hiding this comment.
nit: grammar - Nighthawk uses
| }; | ||
|
|
||
| // For histogram metrics, Nighthawk Statistic is used instead of Envoy Histogram. | ||
| struct BenchmarkClientStatistic { |
There was a problem hiding this comment.
I think the comment helps here, but not elsewhere in the code where this exists. Could we rename these BenchmarkClientCounters and BenchmarkClientHistograms? Or something else that distinguishes their purposes in the names themselves.
source/client/factories_impl.cc
Outdated
| Envoy::Upstream::ClusterManagerPtr& cluster_manager, | ||
| Envoy::Tracing::HttpTracerSharedPtr& http_tracer, absl::string_view cluster_name, | ||
| RequestSource& request_generator) const { | ||
| int sink_stat_prefix, RequestSource& request_generator) const { |
There was a problem hiding this comment.
it's possible this is also true elsewhere in this PR, but is it desirable for us to only allow an int as the sink_stat_prefix? In the future, isn't it possible that we'd want to use a prefix that isn't just the worker number?
If not, then this really isn't just a prefix. It has the specific meaning of being a worker number.
There was a problem hiding this comment.
Discussed offline. Decide to change sink_stat_prefix to worker_id to make it explicitly that this will be used to populate worker_id field in SinkableHdrStatistic
nighthawk/source/client/factories_impl.cc
Lines 45 to 50 in d8f4d31
\cc @oschaaf
Signed-off-by: Qin Qin <qqin@google.com>
|
Thanks @dubious90 for the review! The PR is ready for review again. |
dubious90
left a comment
There was a problem hiding this comment.
LGTM, mumak to approve
docs/root/statistics.md
Outdated
| In Envoy, the stat | ||
| [Store](https://github.com/envoyproxy/envoy/blob/74530c92cfa3682b49b540fddf2aba40ac10c68e/include/envoy/stats/store.h#L29) | ||
| is a singleton and provides a simple interface by which the rest of the code can | ||
| obtain handles to scopes, counters, gauges, and histograms. Envoy counters and |
There was a problem hiding this comment.
Sorry I was unclear here. I was actually referring to scopes. I realize that scopes are probably not metrics themselves, but I'm not sure what they are in this context, and unlike store, we don't have a link to envoy documentation.
|
/retest |
|
🔨 rebuilding |
Add the following latency histogram in benchmark_client_impl :
Linked Issues: #344
Testing: unit tests
Docs Changes: Add docs/root/statistics.md
Signed-off-by: Qin Qin qqin@google.com