diff --git a/docs/root/statistics.md b/docs/root/statistics.md new file mode 100644 index 000000000..a8926f672 --- /dev/null +++ b/docs/root/statistics.md @@ -0,0 +1,63 @@ +# Nighthawk Statistics + +## Background +Currently Nighthawk only outputs metrics at the end of a test run. There are no metrics streamed during a test run. In order to collect all the useful Nighthawk metrics, we plan to instrument Nighthawk with the functionality to stream its metrics. + + +## Statistics in BenchMarkClient +Per worker Statistics are defined in [benchmark_client_impl.h](https://github.com/envoyproxy/nighthawk/blob/master/source/client/benchmark_client_impl.h) + +Name | Type | Description +-----| ----- | ---------------- +total_req_sent | Counter | Total number of requests sent from Nighthawk +http_xxx | Counter | Total number of response with code xxx +stream_resets | Counter | Total number of sream reset +pool_overflow | Counter | Total number of times connection pool overflowed +pool_connection_failure | Counter | Total number of times pool connection failed +latency_on_success_req | Histogram | Latency (in Microseconds) histogram of successful request with code 2xx +latency_on_error_req | Histogram | Latency (in Microseconds) histogram of error request with code other than 2xx + +## Envoy Metrics Model +Since Nighthawk is built on top of [Envoy](https://github.com/envoyproxy/envoy) and similar metrics have been exported from Envoy, it is natural to follow the example in Envoy. Furthermore *Envoy typed metrics are already being used in Nighthawk* ([example](https://github.com/envoyproxy/nighthawk/blob/master/source/client/benchmark_client_impl.h#L33-L46)). + + +Envoy has 3 types of metrics +- Counters: Unsigned integers (can only increase) represent how many times an event happens, e.g. total number of requests. +- Gauges: Unsigned integers (can increase or decrease), e.g. number of active connections. +- Histograms: Unsigned integers that will yield summarized percentile values. E.g. latency distributions. + + +In Envoy, the stat [Store](https://github.com/envoyproxy/envoy/blob/74530c92cfa3682b49b540fddf2aba40ac10c68e/include/envoy/stats/store.h#L29) is a singleton and provides a simple interface by which the rest of the code can obtain handles to scopes, counters, gauges, and histograms. Envoy counters and gauges are periodically (configured at ~5 sec interval) flushed to the sinks. Note that currently histogram values are sent directly to the sinks. A stat [Sink](https://github.com/envoyproxy/envoy/blob/74530c92cfa3682b49b540fddf2aba40ac10c68e/include/envoy/stats/sink.h#L48) is an interface that takes generic stat data and translates it into a backend-specific wire format. Currently Envoy supports the TCP and UDP [statsd](https://github.com/b/statsd_spec) protocol (implemented in [statsd.h](https://github.com/envoyproxy/envoy/blob/master/source/extensions/stat_sinks/common/statsd/statsd.h)). Users can create their own Sink subclass to translate Envoy metrics into backend-specific format. + +Envoy metrics can be defined using a macro, e.g. +``` +#define ALL_CLUSTER_STATS(COUNTER, GAUGE, HISTOGRAM) + COUNTER(upstream_cx_total) + GAUGE(upstream_cx_active, NeverImport) + HISTOGRAM(upstream_cx_length, Milliseconds) +struct ClusterStats { + ALL_CLUSTER_STATS(GENERATE_COUNTER_STRUCT, GENERATE_GAUGE_STRUCT, GENERATE_HISTOGRAM_STRUCT) +}; +``` + +## Envoy Metrics Limitation +Currently Envoy metrics don't support key-value map. As a result, for metrics to be broken down by certain dimensions, we need to define a separate metric for each dimension. For example, currently Nighthawk defines separate [counters](https://github.com/envoyproxy/nighthawk/blob/master/source/client/benchmark_client_impl.h#L35-L40) to monitor the number of responses with corresponding response code. + +Due to this limitation, for the latency metric we define 2 metrics: *latency_on_success_req* (to capture latency of successful requests with 2xx) and *latency_on_error_req* (to capture latency of error requests with code other than 2xx). + + +## Envoy Metrics Flush +Envoy uses a flush timer to periodically flush metrics into stat sinks ([here](https://github.com/envoyproxy/envoy/blob/74530c92cfa3682b49b540fddf2aba40ac10c68e/source/server/server.cc#L479-L480)) at a configured interval (default to 5 sec). For every metric flush, Envoy will call the function [flushMetricsToSinks()](https://github.com/envoyproxy/envoy/blob/74530c92cfa3682b49b540fddf2aba40ac10c68e/source/server/server.cc#L175) to create a metric snapshot from Envoy stat store and flush the snapshot to all sinks through `sink->flush(snapshot)`. + + +## Metrics Export in Nighthawk +Currently a single Nighthawk can run with multiple workers. In the future, Nighthawk will be extended to be able to run multiple instances together. Since each Nighthawk worker sends requests independently, we decide to export per worker level metrics since it provides several advantages over global level metrics (aggregated across all workers). Notice that *there are existing Nighthawk metrics already defined at per worker level* ([example](https://github.com/envoyproxy/nighthawk/blob/bc72a52efdc489beaa0844b34f136e03394bd355/source/client/benchmark_client_impl.cc#L61)). +- Per worker level metrics provide information about the performance of each worker which will be hidden by the global level metrics. +- Keep the workers independent which makes it easier/efficient to scale up to multiple Nighthawks with large numbers of workers. + +Metrics can be defined at per worker level using [Scope](https://github.com/envoyproxy/envoy/blob/e9c2c8c4a0141c9634316e8283f98f412d0dd207/include/envoy/stats/scope.h#L35) ( e.g. `cluster..total_request_sent`). The dynamic portions of metric (e.g. `worker_id`) can be embedded into the metric name. A [TagSpecifier](https://github.com/envoyproxy/envoy/blob/7a652daf35d7d4a6a6bad5a010fe65947ee4411a/api/envoy/config/metrics/v3/stats.proto#L182) can be specified in the bootstrap configuration, which will transform dynamic portions into tags. When per worker level metrics are exported from Nighthawk, multiple per worker level metrics can be converted into a single metric with a `worker_id` label in the stat Sink if the corresponding backend metric supports key-value map. + +## Reference +- [Nighthawk: architecture and key concepts](https://github.com/envoyproxy/nighthawk/blob/master/docs/root/overview.md) +- [Envoy Stats System](https://github.com/envoyproxy/envoy/blob/master/source/docs/stats.md) +- [Envoy Stats blog](https://blog.envoyproxy.io/envoy-stats-b65c7f363342) diff --git a/source/client/benchmark_client_impl.cc b/source/client/benchmark_client_impl.cc index 37902e58f..3f5fbaa5a 100644 --- a/source/client/benchmark_client_impl.cc +++ b/source/client/benchmark_client_impl.cc @@ -58,7 +58,8 @@ BenchmarkClientHttpImpl::BenchmarkClientHttpImpl( response_statistic_(std::move(response_statistic)), response_header_size_statistic_(std::move(response_header_size_statistic)), response_body_size_statistic_(std::move(response_body_size_statistic)), use_h2_(use_h2), - benchmark_client_stats_({ALL_BENCHMARK_CLIENT_STATS(POOL_COUNTER(*scope_))}), + benchmark_client_stats_( + {ALL_BENCHMARK_CLIENT_STATS(POOL_COUNTER(*scope_), POOL_HISTOGRAM(*scope_))}), cluster_manager_(cluster_manager), http_tracer_(http_tracer), cluster_name_(std::string(cluster_name)), request_generator_(std::move(request_generator)), provide_resource_backpressure_(provide_resource_backpressure) { @@ -123,6 +124,7 @@ bool BenchmarkClientHttpImpl::tryStartRequest(CompletionCallback caller_completi *response_body_size_statistic_, request->header(), shouldMeasureLatencies(), content_length, generator_, http_tracer_); requests_initiated_++; + benchmark_client_stats_.total_req_sent_.inc(); pool_ptr->newStream(*stream_decoder, *stream_decoder); return true; } @@ -168,5 +170,14 @@ void BenchmarkClientHttpImpl::onPoolFailure(Envoy::Http::ConnectionPool::PoolFai } } +void BenchmarkClientHttpImpl::exportLatency(const uint32_t response_code, + const uint64_t latency_us) { + if (response_code > 199 && response_code <= 299) { + benchmark_client_stats_.latency_on_success_req_.recordValue(latency_us); + } else { + benchmark_client_stats_.latency_on_error_req_.recordValue(latency_us); + } +} + } // namespace Client -} // namespace Nighthawk \ No newline at end of file +} // namespace Nighthawk diff --git a/source/client/benchmark_client_impl.h b/source/client/benchmark_client_impl.h index a40e90a19..2903c935c 100644 --- a/source/client/benchmark_client_impl.h +++ b/source/client/benchmark_client_impl.h @@ -30,7 +30,7 @@ using namespace std::chrono_literals; using namespace Envoy; // We need this because of macro expectations. -#define ALL_BENCHMARK_CLIENT_STATS(COUNTER) \ +#define ALL_BENCHMARK_CLIENT_STATS(COUNTER, HISTOGRAM) \ COUNTER(stream_resets) \ COUNTER(http_1xx) \ COUNTER(http_2xx) \ @@ -39,10 +39,13 @@ using namespace Envoy; // We need this because of macro expectations. COUNTER(http_5xx) \ COUNTER(http_xxx) \ COUNTER(pool_overflow) \ - COUNTER(pool_connection_failure) + COUNTER(pool_connection_failure) \ + COUNTER(total_req_sent) \ + HISTOGRAM(latency_on_success_req, Microseconds) \ + HISTOGRAM(latency_on_error_req, Microseconds) struct BenchmarkClientStats { - ALL_BENCHMARK_CLIENT_STATS(GENERATE_COUNTER_STRUCT) + ALL_BENCHMARK_CLIENT_STATS(GENERATE_COUNTER_STRUCT, GENERATE_HISTOGRAM_STRUCT) }; class Http1PoolImpl : public Envoy::Http::Http1::ProdConnPoolImpl { @@ -104,6 +107,7 @@ class BenchmarkClientHttpImpl : public BenchmarkClient, // StreamDecoderCompletionCallback void onComplete(bool success, const Envoy::Http::ResponseHeaderMap& headers) override; void onPoolFailure(Envoy::Http::ConnectionPool::PoolFailureReason reason) override; + void exportLatency(const uint32_t response_code, const uint64_t latency_us) override; // Helpers Envoy::Http::ConnectionPool::Instance* pool() { @@ -142,4 +146,4 @@ class BenchmarkClientHttpImpl : public BenchmarkClient, }; } // namespace Client -} // namespace Nighthawk \ No newline at end of file +} // namespace Nighthawk diff --git a/source/client/stream_decoder.cc b/source/client/stream_decoder.cc index 96c6295d5..af25d6aaf 100644 --- a/source/client/stream_decoder.cc +++ b/source/client/stream_decoder.cc @@ -49,6 +49,13 @@ void StreamDecoder::onComplete(bool success) { ASSERT(!success || complete_); if (success && measure_latencies_) { latency_statistic_.addValue((time_source_.monotonicTime() - request_start_).count()); + // At this point StreamDecoder::decodeHeaders() should have been called. + if (stream_info_.response_code_.has_value()) { + const uint64_t latency_us = std::chrono::duration_cast( + time_source_.monotonicTime() - request_start_) + .count(); + decoder_completion_callback_.exportLatency(stream_info_.response_code_.value(), latency_us); + } } upstream_timing_.onLastUpstreamRxByteReceived(time_source_); response_body_sizes_statistic_.addValue(stream_info_.bytesSent()); diff --git a/source/client/stream_decoder.h b/source/client/stream_decoder.h index 869a8d57e..6edb1b0e0 100644 --- a/source/client/stream_decoder.h +++ b/source/client/stream_decoder.h @@ -24,6 +24,7 @@ class StreamDecoderCompletionCallback { virtual ~StreamDecoderCompletionCallback() = default; virtual void onComplete(bool success, const Envoy::Http::ResponseHeaderMap& headers) PURE; virtual void onPoolFailure(Envoy::Http::ConnectionPool::PoolFailureReason reason) PURE; + virtual void exportLatency(const uint32_t response_code, const uint64_t latency_us) PURE; }; // TODO(oschaaf): create a StreamDecoderPool? diff --git a/test/BUILD b/test/BUILD index 868394027..c6d91d293 100644 --- a/test/BUILD +++ b/test/BUILD @@ -18,9 +18,9 @@ envoy_cc_test( "//test/test_common:environment_lib", "@envoy//source/common/http:header_map_lib_with_external_headers", "@envoy//source/common/network:utility_lib_with_external_headers", - "@envoy//source/common/stats:isolated_store_lib_with_external_headers", "@envoy//source/exe:process_wide_lib_with_external_headers", "@envoy//test/mocks/runtime:runtime_mocks", + "@envoy//test/mocks/stats:stats_mocks", "@envoy//test/mocks/stream_info:stream_info_mocks", "@envoy//test/mocks/thread_local:thread_local_mocks", "@envoy//test/mocks/upstream:upstream_mocks", diff --git a/test/benchmark_http_client_test.cc b/test/benchmark_http_client_test.cc index 0539f487b..f535eed2a 100644 --- a/test/benchmark_http_client_test.cc +++ b/test/benchmark_http_client_test.cc @@ -3,10 +3,10 @@ #include "external/envoy/source/common/http/header_map_impl.h" #include "external/envoy/source/common/network/utility.h" #include "external/envoy/source/common/runtime/runtime_impl.h" -#include "external/envoy/source/common/stats/isolated_store_impl.h" #include "external/envoy/source/exe/process_wide.h" #include "external/envoy/test/mocks/common.h" #include "external/envoy/test/mocks/runtime/mocks.h" +#include "external/envoy/test/mocks/stats/mocks.h" #include "external/envoy/test/mocks/stream_info/mocks.h" #include "external/envoy/test/mocks/thread_local/mocks.h" #include "external/envoy/test/mocks/upstream/mocks.h" @@ -124,7 +124,7 @@ class BenchmarkClientHttpTest : public Test { void setupBenchmarkClient() { client_ = std::make_unique( - *api_, *dispatcher_, store_, std::make_unique(), + *api_, *dispatcher_, mock_store_, std::make_unique(), std::make_unique(), std::make_unique(), std::make_unique(), false, cluster_manager_, http_tracer_, "benchmark", request_generator_, true); @@ -143,7 +143,8 @@ class BenchmarkClientHttpTest : public Test { } Envoy::Event::TestRealTimeSystem time_system_; - Envoy::Stats::IsolatedStoreImpl store_; + // deliverHistogramToSinks() is not implemented in IsolatedStoreImpl so test with a mock store. + Envoy::Stats::MockIsolatedStatsStore mock_store_; Envoy::Api::ApiPtr api_; Envoy::Event::DispatcherPtr dispatcher_; Envoy::Runtime::RandomGeneratorImpl generator_; @@ -166,45 +167,76 @@ TEST_F(BenchmarkClientHttpTest, BasicTestH1200) { response_code_ = "200"; testBasicFunctionality(2, 3, 10); EXPECT_EQ(5, getCounter("http_2xx")); + EXPECT_EQ(5, getCounter("total_req_sent")); } TEST_F(BenchmarkClientHttpTest, BasicTestH1300) { response_code_ = "300"; testBasicFunctionality(0, 11, 10); EXPECT_EQ(10, getCounter("http_3xx")); + EXPECT_EQ(10, getCounter("total_req_sent")); } TEST_F(BenchmarkClientHttpTest, BasicTestH1404) { response_code_ = "404"; testBasicFunctionality(0, 1, 10); EXPECT_EQ(1, getCounter("http_4xx")); + EXPECT_EQ(1, getCounter("total_req_sent")); } TEST_F(BenchmarkClientHttpTest, WeirdStatus) { response_code_ = "601"; testBasicFunctionality(0, 1, 10); EXPECT_EQ(1, getCounter("http_xxx")); + EXPECT_EQ(1, getCounter("total_req_sent")); } TEST_F(BenchmarkClientHttpTest, EnableLatencyMeasurement) { setupBenchmarkClient(); EXPECT_EQ(false, client_->shouldMeasureLatencies()); + EXPECT_CALL(mock_store_, deliverHistogramToSinks(_, _)).Times(0); testBasicFunctionality(10, 1, 10); EXPECT_EQ(0, client_->statistics()["benchmark_http_client.queue_to_connect"]->count()); EXPECT_EQ(0, client_->statistics()["benchmark_http_client.request_to_response"]->count()); + EXPECT_EQ(10, getCounter("total_req_sent")); client_->setShouldMeasureLatencies(true); + EXPECT_CALL(mock_store_, deliverHistogramToSinks(_, _)).Times(10); testBasicFunctionality(10, 1, 10); EXPECT_EQ(10, client_->statistics()["benchmark_http_client.queue_to_connect"]->count()); EXPECT_EQ(10, client_->statistics()["benchmark_http_client.request_to_response"]->count()); + EXPECT_EQ(20, getCounter("total_req_sent")); +} + +TEST_F(BenchmarkClientHttpTest, ExportSuccessLatency) { + setupBenchmarkClient(); + uint64_t latency = 10; + EXPECT_CALL(mock_store_, deliverHistogramToSinks(Property(&Envoy::Stats::Metric::name, + "benchmark.latency_on_success_req"), + latency)) + .Times(1); + EXPECT_CALL(mock_store_, + deliverHistogramToSinks( + Property(&Envoy::Stats::Metric::name, "benchmark.latency_on_error_req"), latency)) + .Times(0); + client_->exportLatency(/*response_code=*/200, /*latency_us=*/latency); +} + +TEST_F(BenchmarkClientHttpTest, ExportErrorLatency) { + setupBenchmarkClient(); + uint64_t latency = 10; + EXPECT_CALL(mock_store_, deliverHistogramToSinks(Property(&Envoy::Stats::Metric::name, + "benchmark.latency_on_success_req"), + latency)) + .Times(0); + EXPECT_CALL(mock_store_, + deliverHistogramToSinks( + Property(&Envoy::Stats::Metric::name, "benchmark.latency_on_error_req"), latency)) + .Times(1); + client_->exportLatency(/*response_code=*/500, /*latency_us=*/latency); } TEST_F(BenchmarkClientHttpTest, StatusTrackingInOnComplete) { - auto store = std::make_unique(); - client_ = std::make_unique( - *api_, *dispatcher_, *store, std::make_unique(), - std::make_unique(), std::make_unique(), - std::make_unique(), false, cluster_manager_, http_tracer_, "foo", - request_generator_, true); + setupBenchmarkClient(); Envoy::Http::ResponseHeaderMapPtr header = Envoy::Http::ResponseHeaderMapImpl::create(); header->setStatus(1); diff --git a/test/integration/test_integration_basics.py b/test/integration/test_integration_basics.py index df02ef193..bcd57e093 100644 --- a/test/integration/test_integration_basics.py +++ b/test/integration/test_integration_basics.py @@ -34,6 +34,7 @@ def test_http_h1(http_test_server_fixture): assertCounterEqual(counters, "upstream_rq_pending_total", 1) assertCounterEqual(counters, "upstream_rq_total", 25) assertCounterEqual(counters, "default.total_match_count", 1) + assertCounterEqual(counters, "benchmark.total_req_sent", 25) global_histograms = http_test_server_fixture.getNighthawkGlobalHistogramsbyIdFromJson(parsed_json) assertEqual(int(global_histograms["benchmark_http_client.response_body_size"]["count"]), 25) @@ -47,7 +48,7 @@ def test_http_h1(http_test_server_fixture): assertEqual(int(global_histograms["benchmark_http_client.response_body_size"]["raw_pstdev"]), 0) assertEqual(int(global_histograms["benchmark_http_client.response_header_size"]["raw_pstdev"]), 0) - assertEqual(len(counters), 12) + assertEqual(len(counters), 13) def mini_stress_test(fixture, args): @@ -179,7 +180,8 @@ def test_http_h2(http_test_server_fixture): assertCounterEqual(counters, "upstream_rq_pending_total", 1) assertCounterEqual(counters, "upstream_rq_total", 25) assertCounterEqual(counters, "default.total_match_count", 1) - assertEqual(len(counters), 12) + assertCounterEqual(counters, "benchmark.total_req_sent", 25) + assertEqual(len(counters), 13) def test_http_concurrency(http_test_server_fixture): @@ -224,7 +226,8 @@ def test_https_h1(https_test_server_fixture): assertCounterEqual(counters, "ssl.sigalgs.rsa_pss_rsae_sha256", 1) assertCounterEqual(counters, "ssl.versions.TLSv1.2", 1) assertCounterEqual(counters, "default.total_match_count", 1) - assertEqual(len(counters), 17) + assertCounterEqual(counters, "benchmark.total_req_sent", 25) + assertEqual(len(counters), 18) server_stats = https_test_server_fixture.getTestServerStatisticsJson() assertEqual( @@ -258,7 +261,8 @@ def test_https_h2(https_test_server_fixture): assertCounterEqual(counters, "ssl.sigalgs.rsa_pss_rsae_sha256", 1) assertCounterEqual(counters, "ssl.versions.TLSv1.2", 1) assertCounterEqual(counters, "default.total_match_count", 1) - assertEqual(len(counters), 17) + assertCounterEqual(counters, "benchmark.total_req_sent", 25) + assertEqual(len(counters), 18) def test_https_h2_multiple_connections(https_test_server_fixture): diff --git a/test/stream_decoder_test.cc b/test/stream_decoder_test.cc index 93cb5333f..9327bbecd 100644 --- a/test/stream_decoder_test.cc +++ b/test/stream_decoder_test.cc @@ -37,6 +37,9 @@ class StreamDecoderTest : public Test, public StreamDecoderCompletionCallback { stream_decoder_completion_callbacks_++; } void onPoolFailure(Envoy::Http::ConnectionPool::PoolFailureReason) override { pool_failures_++; } + void exportLatency(const uint32_t, const uint64_t) override { + stream_decoder_export_latency_callbacks_++; + } Envoy::Event::TestRealTimeSystem time_system_; Envoy::Stats::IsolatedStoreImpl store_; @@ -49,6 +52,7 @@ class StreamDecoderTest : public Test, public StreamDecoderCompletionCallback { HeaderMapPtr request_headers_; uint64_t stream_decoder_completion_callbacks_{0}; uint64_t pool_failures_{0}; + uint64_t stream_decoder_export_latency_callbacks_{0}; Envoy::Runtime::RandomGeneratorImpl random_generator_; Envoy::Tracing::HttpTracerSharedPtr http_tracer_; Envoy::Http::ResponseHeaderMapPtr test_header_; @@ -64,6 +68,7 @@ TEST_F(StreamDecoderTest, HeaderOnlyTest) { decoder->decodeHeaders(std::move(test_header_), true); EXPECT_TRUE(is_complete); EXPECT_EQ(1, stream_decoder_completion_callbacks_); + EXPECT_EQ(0, stream_decoder_export_latency_callbacks_); } TEST_F(StreamDecoderTest, HeaderWithBodyTest) { @@ -112,6 +117,7 @@ TEST_F(StreamDecoderTest, LatencyIsNotMeasured) { decoder->decodeHeaders(std::move(test_header_), true); EXPECT_EQ(0, connect_statistic_.count()); EXPECT_EQ(0, latency_statistic_.count()); + EXPECT_EQ(0, stream_decoder_export_latency_callbacks_); } TEST_F(StreamDecoderTest, LatencyIsMeasured) { @@ -146,9 +152,11 @@ TEST_F(StreamDecoderTest, LatencyIsMeasured) { decoder->onPoolReady(stream_encoder, ptr, stream_info); EXPECT_EQ(1, connect_statistic_.count()); decoder->decodeHeaders(std::move(test_header_), false); + EXPECT_EQ(0, stream_decoder_export_latency_callbacks_); decoder->decodeTrailers(std::move(test_trailer_)); EXPECT_EQ(1, connect_statistic_.count()); EXPECT_EQ(1, latency_statistic_.count()); + EXPECT_EQ(1, stream_decoder_export_latency_callbacks_); } TEST_F(StreamDecoderTest, StreamResetTest) { @@ -161,6 +169,7 @@ TEST_F(StreamDecoderTest, StreamResetTest) { decoder->onResetStream(Envoy::Http::StreamResetReason::LocalReset, "fooreason"); EXPECT_TRUE(is_complete); // these do get reported. EXPECT_EQ(1, stream_decoder_completion_callbacks_); + EXPECT_EQ(0, stream_decoder_export_latency_callbacks_); } TEST_F(StreamDecoderTest, PoolFailureTest) {