envoyproxy · mum4k · Jul 23, 2020 · Jun 22, 2020 · Jun 23, 2020 · Jun 24, 2020
diff --git a/docs/root/statistics.md b/docs/root/statistics.md
@@ -0,0 +1,63 @@
+# Nighthawk Statistics
+
+## Background
+Currently Nighthawk only outputs metrics at the end of a test run. There are no metrics streamed during a test run. In order to collect all the useful Nighthawk metrics, we plan to instrument Nighthawk with the functionality to stream its metrics.
+
+
+## Statistics in BenchMarkClient
+Per worker Statistics are defined in [benchmark_client_impl.h](https://github.com/envoyproxy/nighthawk/blob/master/source/client/benchmark_client_impl.h)
+
+Name | Type | Description
+-----| ----- | ----------------
+total_req_sent | Counter | Total number of requests sent from Nighthawk
+http_xxx | Counter | Total number of response with code xxx
+stream_resets | Counter | Total number of sream reset
+pool_overflow | Counter | Total number of times connection pool overflowed
+pool_connection_failure | Counter | Total number of times pool connection failed
+latency_on_success_req | Histogram | Latency (in Microseconds) histogram of successful request with code 2xx
+latency_on_error_req | Histogram | Latency (in Microseconds) histogram of error request with code other than 2xx
+
+## Envoy Metrics Model
+Since Nighthawk is built on top of [Envoy](https://github.com/envoyproxy/envoy) and similar metrics have been exported from Envoy, it is natural to follow the example in Envoy. Furthermore *Envoy typed metrics are already being used in Nighthawk* ([example](https://github.com/envoyproxy/nighthawk/blob/master/source/client/benchmark_client_impl.h#L33-L46)).
+
+
+Envoy has 3 types of metrics
+- Counters: Unsigned integers (can only increase) represent how many times an event happens, e.g. total number of requests.
+- Gauges: Unsigned integers (can increase or decrease), e.g. number of active connections.
+- Histograms: Unsigned integers that will yield summarized percentile values. E.g. latency distributions.
+
+
+In Envoy, the stat [Store](https://github.com/envoyproxy/envoy/blob/74530c92cfa3682b49b540fddf2aba40ac10c68e/include/envoy/stats/store.h#L29) is a singleton and provides a simple interface by which the rest of the code can obtain handles to scopes, counters, gauges, and histograms. Envoy counters and gauges are periodically (configured at ~5 sec interval) flushed to the sinks. Note that currently histogram values are sent directly to the sinks. A stat [Sink](https://github.com/envoyproxy/envoy/blob/74530c92cfa3682b49b540fddf2aba40ac10c68e/include/envoy/stats/sink.h#L48) is an interface that takes generic stat data and translates it into a backend-specific wire format. Currently Envoy supports the TCP and UDP [statsd](https://github.com/b/statsd_spec) protocol (implemented in [statsd.h](https://github.com/envoyproxy/envoy/blob/master/source/extensions/stat_sinks/common/statsd/statsd.h)). Users can create their own Sink subclass to translate Envoy metrics into backend-specific format.
+
+Envoy metrics can be defined using a macro, e.g.
+```
+#define ALL_CLUSTER_STATS(COUNTER, GAUGE, HISTOGRAM)
+  COUNTER(upstream_cx_total)
+  GAUGE(upstream_cx_active, NeverImport)
+  HISTOGRAM(upstream_cx_length, Milliseconds)
+struct ClusterStats {
+  ALL_CLUSTER_STATS(GENERATE_COUNTER_STRUCT, GENERATE_GAUGE_STRUCT, GENERATE_HISTOGRAM_STRUCT)
+};
+```
+
+## Envoy Metrics Limitation
+Currently Envoy metrics don't support key-value map. As a result, for metrics to be broken down by certain dimensions, we need to define a separate metric for each dimension. For example, currently Nighthawk defines separate [counters](https://github.com/envoyproxy/nighthawk/blob/master/source/client/benchmark_client_impl.h#L35-L40) to monitor the number of responses with corresponding response code.
+
+Due to this limitation, for the latency metric we define 2 metrics: *latency_on_success_req* (to capture latency of successful requests with 2xx) and *latency_on_error_req* (to capture latency of error requests with code other than 2xx).
+
+
+## Envoy Metrics Flush
+Envoy uses a flush timer to periodically flush metrics into stat sinks ([here](https://github.com/envoyproxy/envoy/blob/74530c92cfa3682b49b540fddf2aba40ac10c68e/source/server/server.cc#L479-L480)) at a configured interval (default to 5 sec). For every metric flush, Envoy will call the function [flushMetricsToSinks()](https://github.com/envoyproxy/envoy/blob/74530c92cfa3682b49b540fddf2aba40ac10c68e/source/server/server.cc#L175) to create a metric snapshot from Envoy stat store and flush the snapshot to all sinks through `sink->flush(snapshot)`.
+
+
+## Metrics Export in Nighthawk
+Currently a single Nighthawk can run with multiple workers. In the future, Nighthawk will be extended to be able to run multiple instances together. Since each Nighthawk worker sends requests independently, we decide to export per worker level metrics since it provides several advantages over global level metrics (aggregated across all workers). Notice that *there are existing Nighthawk metrics already defined at per worker level* ([example](https://github.com/envoyproxy/nighthawk/blob/bc72a52efdc489beaa0844b34f136e03394bd355/source/client/benchmark_client_impl.cc#L61)).
+- Per worker level metrics provide information about the performance of each worker which will be hidden by the global level metrics.
+- Keep the workers independent which makes it easier/efficient to scale up to multiple Nighthawks with large numbers of workers.
+
+Metrics can be defined at per worker level using [Scope](https://github.com/envoyproxy/envoy/blob/e9c2c8c4a0141c9634316e8283f98f412d0dd207/include/envoy/stats/scope.h#L35) ( e.g. `cluster.<worker_id>.total_request_sent`). The dynamic portions of metric (e.g. `worker_id`) can be embedded into the metric name. A [TagSpecifier](https://github.com/envoyproxy/envoy/blob/7a652daf35d7d4a6a6bad5a010fe65947ee4411a/api/envoy/config/metrics/v3/stats.proto#L182) can be specified in the bootstrap configuration, which will transform dynamic portions into tags. When per worker level metrics are exported from Nighthawk, multiple per worker level metrics can be converted into a single metric with a `worker_id` label in the stat Sink if the corresponding backend metric supports key-value map.
+
+## Reference
+- [Nighthawk: architecture and key concepts](https://github.com/envoyproxy/nighthawk/blob/master/docs/root/overview.md)
+- [Envoy Stats System](https://github.com/envoyproxy/envoy/blob/master/source/docs/stats.md)
+- [Envoy Stats blog](https://blog.envoyproxy.io/envoy-stats-b65c7f363342)
diff --git a/source/client/benchmark_client_impl.cc b/source/client/benchmark_client_impl.cc
@@ -58,7 +58,8 @@ BenchmarkClientHttpImpl::BenchmarkClientHttpImpl(
       response_statistic_(std::move(response_statistic)),
       response_header_size_statistic_(std::move(response_header_size_statistic)),
       response_body_size_statistic_(std::move(response_body_size_statistic)), use_h2_(use_h2),
-      benchmark_client_stats_({ALL_BENCHMARK_CLIENT_STATS(POOL_COUNTER(*scope_))}),
+      benchmark_client_stats_(
+          {ALL_BENCHMARK_CLIENT_STATS(POOL_COUNTER(*scope_), POOL_HISTOGRAM(*scope_))}),
       cluster_manager_(cluster_manager), http_tracer_(http_tracer),
       cluster_name_(std::string(cluster_name)), request_generator_(std::move(request_generator)),
       provide_resource_backpressure_(provide_resource_backpressure) {
@@ -123,6 +124,7 @@ bool BenchmarkClientHttpImpl::tryStartRequest(CompletionCallback caller_completi
       *response_body_size_statistic_, request->header(), shouldMeasureLatencies(), content_length,
       generator_, http_tracer_);
   requests_initiated_++;
+  benchmark_client_stats_.total_req_sent_.inc();
   pool_ptr->newStream(*stream_decoder, *stream_decoder);
   return true;
 }
@@ -168,5 +170,14 @@ void BenchmarkClientHttpImpl::onPoolFailure(Envoy::Http::ConnectionPool::PoolFai
   }
 }
 
+void BenchmarkClientHttpImpl::exportLatency(const uint32_t response_code,
+                                            const uint64_t latency_us) {
+  if (response_code > 199 && response_code <= 299) {
+    benchmark_client_stats_.latency_on_success_req_.recordValue(latency_us);
+  } else {
+    benchmark_client_stats_.latency_on_error_req_.recordValue(latency_us);
+  }
+}
+
 } // namespace Client
-} // namespace Nighthawk
+} // namespace Nighthawk
diff --git a/source/client/benchmark_client_impl.h b/source/client/benchmark_client_impl.h
@@ -30,7 +30,7 @@ using namespace std::chrono_literals;
 
 using namespace Envoy; // We need this because of macro expectations.
 
-#define ALL_BENCHMARK_CLIENT_STATS(COUNTER)                                                        \
+#define ALL_BENCHMARK_CLIENT_STATS(COUNTER, HISTOGRAM)                                             \
   COUNTER(stream_resets)                                                                           \
   COUNTER(http_1xx)                                                                                \
   COUNTER(http_2xx)                                                                                \
@@ -39,10 +39,13 @@ using namespace Envoy; // We need this because of macro expectations.
   COUNTER(http_5xx)                                                                                \
   COUNTER(http_xxx)                                                                                \
   COUNTER(pool_overflow)                                                                           \
-  COUNTER(pool_connection_failure)
+  COUNTER(pool_connection_failure)                                                                 \
+  COUNTER(total_req_sent)                                                                          \
+  HISTOGRAM(latency_on_success_req, Microseconds)                                                  \
+  HISTOGRAM(latency_on_error_req, Microseconds)
 
 struct BenchmarkClientStats {
-  ALL_BENCHMARK_CLIENT_STATS(GENERATE_COUNTER_STRUCT)
+  ALL_BENCHMARK_CLIENT_STATS(GENERATE_COUNTER_STRUCT, GENERATE_HISTOGRAM_STRUCT)
 };
 
 class Http1PoolImpl : public Envoy::Http::Http1::ProdConnPoolImpl {
@@ -104,6 +107,7 @@ class BenchmarkClientHttpImpl : public BenchmarkClient,
   // StreamDecoderCompletionCallback
   void onComplete(bool success, const Envoy::Http::ResponseHeaderMap& headers) override;
   void onPoolFailure(Envoy::Http::ConnectionPool::PoolFailureReason reason) override;
+  void exportLatency(const uint32_t response_code, const uint64_t latency_us) override;
 
   // Helpers
   Envoy::Http::ConnectionPool::Instance* pool() {
@@ -142,4 +146,4 @@ class BenchmarkClientHttpImpl : public BenchmarkClient,
 };
 
 } // namespace Client
-} // namespace Nighthawk
+} // namespace Nighthawk
diff --git a/source/client/stream_decoder.cc b/source/client/stream_decoder.cc
@@ -49,6 +49,13 @@ void StreamDecoder::onComplete(bool success) {
   ASSERT(!success || complete_);
   if (success && measure_latencies_) {
     latency_statistic_.addValue((time_source_.monotonicTime() - request_start_).count());
+    // At this point StreamDecoder::decodeHeaders() should have been called.
+    if (stream_info_.response_code_.has_value()) {
+      const uint64_t latency_us = std::chrono::duration_cast<std::chrono::microseconds>(
+                                      time_source_.monotonicTime() - request_start_)
+                                      .count();
+      decoder_completion_callback_.exportLatency(stream_info_.response_code_.value(), latency_us);
+    }
   }
   upstream_timing_.onLastUpstreamRxByteReceived(time_source_);
   response_body_sizes_statistic_.addValue(stream_info_.bytesSent());

diff --git a/source/client/stream_decoder.h b/source/client/stream_decoder.h
@@ -24,6 +24,7 @@ class StreamDecoderCompletionCallback {
   virtual ~StreamDecoderCompletionCallback() = default;
   virtual void onComplete(bool success, const Envoy::Http::ResponseHeaderMap& headers) PURE;
   virtual void onPoolFailure(Envoy::Http::ConnectionPool::PoolFailureReason reason) PURE;
+  virtual void exportLatency(const uint32_t response_code, const uint64_t latency_us) PURE;
 };
 
 // TODO(oschaaf): create a StreamDecoderPool?

diff --git a/test/BUILD b/test/BUILD
@@ -18,9 +18,9 @@ envoy_cc_test(
         "//test/test_common:environment_lib",
         "@envoy//source/common/http:header_map_lib_with_external_headers",
         "@envoy//source/common/network:utility_lib_with_external_headers",
-        "@envoy//source/common/stats:isolated_store_lib_with_external_headers",
         "@envoy//source/exe:process_wide_lib_with_external_headers",
         "@envoy//test/mocks/runtime:runtime_mocks",
+        "@envoy//test/mocks/stats:stats_mocks",
         "@envoy//test/mocks/stream_info:stream_info_mocks",
         "@envoy//test/mocks/thread_local:thread_local_mocks",
         "@envoy//test/mocks/upstream:upstream_mocks",

diff --git a/test/benchmark_http_client_test.cc b/test/benchmark_http_client_test.cc
@@ -3,10 +3,10 @@
 #include "external/envoy/source/common/http/header_map_impl.h"
 #include "external/envoy/source/common/network/utility.h"
 #include "external/envoy/source/common/runtime/runtime_impl.h"
-#include "external/envoy/source/common/stats/isolated_store_impl.h"
 #include "external/envoy/source/exe/process_wide.h"
 #include "external/envoy/test/mocks/common.h"
 #include "external/envoy/test/mocks/runtime/mocks.h"
+#include "external/envoy/test/mocks/stats/mocks.h"
 #include "external/envoy/test/mocks/stream_info/mocks.h"
 #include "external/envoy/test/mocks/thread_local/mocks.h"
 #include "external/envoy/test/mocks/upstream/mocks.h"
@@ -124,7 +124,7 @@ class BenchmarkClientHttpTest : public Test {
 
   void setupBenchmarkClient() {
     client_ = std::make_unique<Client::BenchmarkClientHttpImpl>(
-        *api_, *dispatcher_, store_, std::make_unique<StreamingStatistic>(),
+        *api_, *dispatcher_, mock_store_, std::make_unique<StreamingStatistic>(),
         std::make_unique<StreamingStatistic>(), std::make_unique<StreamingStatistic>(),
         std::make_unique<StreamingStatistic>(), false, cluster_manager_, http_tracer_, "benchmark",
         request_generator_, true);
@@ -143,7 +143,8 @@ class BenchmarkClientHttpTest : public Test {
   }
 
   Envoy::Event::TestRealTimeSystem time_system_;
-  Envoy::Stats::IsolatedStoreImpl store_;
+  // deliverHistogramToSinks() is not implemented in IsolatedStoreImpl so test with a mock store.
+  Envoy::Stats::MockIsolatedStatsStore mock_store_;
   Envoy::Api::ApiPtr api_;
   Envoy::Event::DispatcherPtr dispatcher_;
   Envoy::Runtime::RandomGeneratorImpl generator_;
@@ -166,45 +167,76 @@ TEST_F(BenchmarkClientHttpTest, BasicTestH1200) {
   response_code_ = "200";
   testBasicFunctionality(2, 3, 10);
   EXPECT_EQ(5, getCounter("http_2xx"));
+  EXPECT_EQ(5, getCounter("total_req_sent"));
 }
 
 TEST_F(BenchmarkClientHttpTest, BasicTestH1300) {
   response_code_ = "300";
   testBasicFunctionality(0, 11, 10);
   EXPECT_EQ(10, getCounter("http_3xx"));
+  EXPECT_EQ(10, getCounter("total_req_sent"));
 }
 
 TEST_F(BenchmarkClientHttpTest, BasicTestH1404) {
   response_code_ = "404";
   testBasicFunctionality(0, 1, 10);
   EXPECT_EQ(1, getCounter("http_4xx"));
+  EXPECT_EQ(1, getCounter("total_req_sent"));
 }
 
 TEST_F(BenchmarkClientHttpTest, WeirdStatus) {
   response_code_ = "601";
   testBasicFunctionality(0, 1, 10);
   EXPECT_EQ(1, getCounter("http_xxx"));
+  EXPECT_EQ(1, getCounter("total_req_sent"));
 }
 
 TEST_F(BenchmarkClientHttpTest, EnableLatencyMeasurement) {
   setupBenchmarkClient();
   EXPECT_EQ(false, client_->shouldMeasureLatencies());
+  EXPECT_CALL(mock_store_, deliverHistogramToSinks(_, _)).Times(0);
   testBasicFunctionality(10, 1, 10);
   EXPECT_EQ(0, client_->statistics()["benchmark_http_client.queue_to_connect"]->count());
   EXPECT_EQ(0, client_->statistics()["benchmark_http_client.request_to_response"]->count());
+  EXPECT_EQ(10, getCounter("total_req_sent"));
   client_->setShouldMeasureLatencies(true);
+  EXPECT_CALL(mock_store_, deliverHistogramToSinks(_, _)).Times(10);
   testBasicFunctionality(10, 1, 10);
   EXPECT_EQ(10, client_->statistics()["benchmark_http_client.queue_to_connect"]->count());
   EXPECT_EQ(10, client_->statistics()["benchmark_http_client.request_to_response"]->count());
+  EXPECT_EQ(20, getCounter("total_req_sent"));
+}
+
+TEST_F(BenchmarkClientHttpTest, ExportSuccessLatency) {
+  setupBenchmarkClient();
+  uint64_t latency = 10;
+  EXPECT_CALL(mock_store_, deliverHistogramToSinks(Property(&Envoy::Stats::Metric::name,
+                                                            "benchmark.latency_on_success_req"),
+                                                   latency))
+      .Times(1);
+  EXPECT_CALL(mock_store_,
+              deliverHistogramToSinks(
+                  Property(&Envoy::Stats::Metric::name, "benchmark.latency_on_error_req"), latency))
+      .Times(0);
+  client_->exportLatency(/*response_code=*/200, /*latency_us=*/latency);
+}
+
+TEST_F(BenchmarkClientHttpTest, ExportErrorLatency) {
+  setupBenchmarkClient();
+  uint64_t latency = 10;
+  EXPECT_CALL(mock_store_, deliverHistogramToSinks(Property(&Envoy::Stats::Metric::name,
+                                                            "benchmark.latency_on_success_req"),
+                                                   latency))
+      .Times(0);
+  EXPECT_CALL(mock_store_,
+              deliverHistogramToSinks(
+                  Property(&Envoy::Stats::Metric::name, "benchmark.latency_on_error_req"), latency))
+      .Times(1);
+  client_->exportLatency(/*response_code=*/500, /*latency_us=*/latency);
 }
 
 TEST_F(BenchmarkClientHttpTest, StatusTrackingInOnComplete) {
-  auto store = std::make_unique<Envoy::Stats::IsolatedStoreImpl>();
-  client_ = std::make_unique<Client::BenchmarkClientHttpImpl>(
-      *api_, *dispatcher_, *store, std::make_unique<StreamingStatistic>(),
-      std::make_unique<StreamingStatistic>(), std::make_unique<StreamingStatistic>(),
-      std::make_unique<StreamingStatistic>(), false, cluster_manager_, http_tracer_, "foo",
-      request_generator_, true);
+  setupBenchmarkClient();
   Envoy::Http::ResponseHeaderMapPtr header = Envoy::Http::ResponseHeaderMapImpl::create();
 
   header->setStatus(1);

diff --git a/test/integration/test_integration_basics.py b/test/integration/test_integration_basics.py
@@ -34,6 +34,7 @@ def test_http_h1(http_test_server_fixture):
   assertCounterEqual(counters, "upstream_rq_pending_total", 1)
   assertCounterEqual(counters, "upstream_rq_total", 25)
   assertCounterEqual(counters, "default.total_match_count", 1)
+  assertCounterEqual(counters, "benchmark.total_req_sent", 25)
 
   global_histograms = http_test_server_fixture.getNighthawkGlobalHistogramsbyIdFromJson(parsed_json)
   assertEqual(int(global_histograms["benchmark_http_client.response_body_size"]["count"]), 25)
@@ -47,7 +48,7 @@ def test_http_h1(http_test_server_fixture):
   assertEqual(int(global_histograms["benchmark_http_client.response_body_size"]["raw_pstdev"]), 0)
   assertEqual(int(global_histograms["benchmark_http_client.response_header_size"]["raw_pstdev"]), 0)
 
-  assertEqual(len(counters), 12)
+  assertEqual(len(counters), 13)
 
 
 def mini_stress_test(fixture, args):
@@ -179,7 +180,8 @@ def test_http_h2(http_test_server_fixture):
   assertCounterEqual(counters, "upstream_rq_pending_total", 1)
   assertCounterEqual(counters, "upstream_rq_total", 25)
   assertCounterEqual(counters, "default.total_match_count", 1)
-  assertEqual(len(counters), 12)
+  assertCounterEqual(counters, "benchmark.total_req_sent", 25)
+  assertEqual(len(counters), 13)
 
 
 def test_http_concurrency(http_test_server_fixture):
@@ -224,7 +226,8 @@ def test_https_h1(https_test_server_fixture):
   assertCounterEqual(counters, "ssl.sigalgs.rsa_pss_rsae_sha256", 1)
   assertCounterEqual(counters, "ssl.versions.TLSv1.2", 1)
   assertCounterEqual(counters, "default.total_match_count", 1)
-  assertEqual(len(counters), 17)
+  assertCounterEqual(counters, "benchmark.total_req_sent", 25)
+  assertEqual(len(counters), 18)
 
   server_stats = https_test_server_fixture.getTestServerStatisticsJson()
   assertEqual(
@@ -258,7 +261,8 @@ def test_https_h2(https_test_server_fixture):
   assertCounterEqual(counters, "ssl.sigalgs.rsa_pss_rsae_sha256", 1)
   assertCounterEqual(counters, "ssl.versions.TLSv1.2", 1)
   assertCounterEqual(counters, "default.total_match_count", 1)
-  assertEqual(len(counters), 17)
+  assertCounterEqual(counters, "benchmark.total_req_sent", 25)
+  assertEqual(len(counters), 18)
 
 
 def test_https_h2_multiple_connections(https_test_server_fixture):