-
Notifications
You must be signed in to change notification settings - Fork 94
Echo back and track origin request-receipt timing deltas #477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 15 commits
c226497
c6697de
28e0d34
94ad984
933c589
a1691ec
740d0ca
1afe2f1
ae0c08e
3487e19
1d1f03a
a9d99eb
0aa8630
db01e40
5e48513
d9569cb
cf1170c
421feb8
af148ea
e813e75
505495c
5405dc9
d858b55
3cf32b8
35aee47
027c00f
70502e1
7774e2d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,6 +19,19 @@ void StreamDecoder::decodeHeaders(Envoy::Http::ResponseHeaderMapPtr&& headers, b | |
| response_header_sizes_statistic_.addValue(response_headers_->byteSize()); | ||
| const uint64_t response_code = Envoy::Http::Utility::getResponseStatus(*response_headers_); | ||
| stream_info_.response_code_ = static_cast<uint32_t>(response_code); | ||
| const auto timing_header_name = Envoy::Http::LowerCaseString("x-nh-do-not-use-origin-timings"); | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note to reviewers: the odd name here is to scare people away from using this. My plan is to make a new option for nighthawk_client to allow configuration of this header name.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. my concern here is i think we are introducing two different header namespaces now. See https://github.com/envoyproxy/nighthawk/blob/master/source/server/well_known_headers.h, where we have If this is the first introduction of If not, can we create an issue to decide on one of those two, and at least make sure that one contains all of the headers, even if we support the other one for backwards compatibility.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. separately, can you explain why having a configurable header name here is beneficial? just not sure why we wouldn't just dictate this
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. re: separately, can you explain why having a configurable header name here is beneficial? just not sure why we wouldn't just dictate this Well, in my opinion, there's certainly something to say for dictating the names: less code, less configuration. Here are the thoughts I had that made me lean towards allow the end user control of the header names involved. I think this mostly applies to both the client and test-server aspects of it:
Considering the above, let me know what you think;
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This makes sense. Thanks for the explanation. |
||
| const auto* timing_header = response_headers_->get(timing_header_name); | ||
|
oschaaf marked this conversation as resolved.
Outdated
|
||
| if (timing_header != nullptr) { | ||
| auto timing_value = timing_header->value().getStringView(); | ||
| int64_t origin_delta; | ||
| if (absl::SimpleAtoi(timing_value, &origin_delta)) { | ||
| origin_latency_statistic_.addValue(origin_delta); | ||
| } else { | ||
| // TODO(XXX): Can we make sure we avoid high frequency logging for this somehow? | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Punted to #484, which has a candidate PoC solution. I remember there are other spots where this would be good to have in the existing code base as well. |
||
| ENVOY_LOG(warn, "Origin delta {} could not be interpreted as an integer.", timing_value); | ||
| } | ||
| } | ||
|
|
||
| if (complete_) { | ||
| onComplete(true); | ||
| } | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -8,6 +8,7 @@ | |
| #include "server/well_known_headers.h" | ||
|
|
||
| #include "absl/strings/numbers.h" | ||
| #include "absl/strings/str_cat.h" | ||
|
|
||
| namespace Nighthawk { | ||
| namespace Server { | ||
|
|
@@ -16,6 +17,22 @@ HttpTestServerDecoderFilterConfig::HttpTestServerDecoderFilterConfig( | |
| nighthawk::server::ResponseOptions proto_config) | ||
| : server_config_(std::move(proto_config)) {} | ||
|
|
||
| uint64_t HttpTestServerDecoderFilterConfig::ThreadSafeMontonicTimeStopwatch::getElapsedNsAndReset( | ||
| Envoy::TimeSource& time_source) { | ||
| Envoy::Thread::LockGuard guard(lock_); | ||
| // Note that we obtain monotonic time under lock, to ensure that start_ will be updated | ||
| // monotonically. | ||
| const Envoy::MonotonicTime new_time = time_source.monotonicTime(); | ||
| const uint64_t elapsed = start_ == Envoy::MonotonicTime::min() ? 0 : (new_time - start_).count(); | ||
|
oschaaf marked this conversation as resolved.
Outdated
|
||
| start_ = new_time; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Warning: Attempting to reason about concurrency... Do you see any weird effects running with concurrency > 1 on the test server? I was imagining a situation with 2 test server threads operating almost in lockstep, but 1ns apart. Suppose the requests come in evenly every 1ms. If the test server got stalled for a few milliseconds and the OS ended up buffering several packets, and then the test server threads started consuming them round robin, wouldn't the timing reports end up alternating 0, 1ns, 999999ns, 1ns, 999999ns, 1ns, ...? This would actually represent the true response timing coming out of the test server, so the code in this PR would be working as intended. Extreme statistics like that could give the impression that there is extreme distortion from the intermediary proxy, but it's entirely an artifact of the Nighthawk test server having two threads and stalling for a few milliseconds. (Also the intermediary might also introduce its own equally extreme irregularities, since it could also be a multithreaded Envoy.) The obvious way to avoid this issue is to use concurrency=1 on the test server. Then we would only be measuring the distortions from intermediaries and the OS. Do you think there's any future reporting that could diagnose this issue? For example, if we also had exactly this PR but with thread-local stopwatches, we would see perfect evenness rather than 1-999999 fluctuation. We would know that the 1-999999 in the cross-thread stats was the fault of our own multithreading, and any timing deviations would be our OS or the intermediary.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, maybe varying between 1ns and 1ms would be considered negligible distortion, especially if the intermediary added multiple milliseconds.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Your reasoning here makes sense to me. The intent of this is to be able to sanity check part of our timings, so if it would catch those extremes where we don't expect them, I think it would be providing value. The way I see it is that this is going to be particularly useful in A/B tests, for example:
Some existing OSS clients can probably easily be tweaked to take these as inputs to their latency reports to figure out their request-release timing precisions. Also, something like the delta between the expected and observed curves could perhaps serve as a means to quantify distortion; Indeed I think the extremes in the scenario you described would be signal in this case, and not noise (looking at you, adaptive load controller :-)). As for tracking thread-local values: I've been thinking about that, and I think it's pretty doable doing that on the server side by adding a new Envoy histogram in the extension. On the client side however, I think it will take more consideration and work, as I think there would be some feature gaps:
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Afterthought: maybe, if we'd have the per worker histograms on the test server side, plus a means to read/sample these histograms on the NH client side and propagate these to the output, that might be a relatively low effort means to achieve this.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Last thought: per-worker tracking may also yield data that's interesting in light of exposing unbalancedness. |
||
| return elapsed; | ||
| } | ||
|
|
||
| uint64_t | ||
| HttpTestServerDecoderFilterConfig::getElapsedNanosSinceLastRequest(Envoy::TimeSource& time_source) { | ||
| return lastRequestStopwatch().getElapsedNsAndReset(time_source); | ||
| } | ||
|
|
||
| HttpTestServerDecoderFilter::HttpTestServerDecoderFilter( | ||
| HttpTestServerDecoderFilterConfigSharedPtr config) | ||
| : config_(std::move(config)) {} | ||
|
|
@@ -32,6 +49,13 @@ void HttpTestServerDecoderFilter::sendReply() { | |
| static_cast<Envoy::Http::Code>(200), response_body, | ||
| [this](Envoy::Http::ResponseHeaderMap& direct_response_headers) { | ||
| Configuration::applyConfigToResponseHeaders(direct_response_headers, base_config_); | ||
| const std::string previous_request_delta_in_response_header = | ||
| base_config_.emit_previous_request_delta_in_response_header(); | ||
| if (!previous_request_delta_in_response_header.empty() && last_request_delta_ns_ > 0) { | ||
| direct_response_headers.appendCopy( | ||
| Envoy::Http::LowerCaseString(previous_request_delta_in_response_header), | ||
| absl::StrCat(last_request_delta_ns_)); | ||
| } | ||
| }, | ||
| absl::nullopt, ""); | ||
| } else { | ||
|
|
@@ -79,6 +103,10 @@ HttpTestServerDecoderFilter::decodeTrailers(Envoy::Http::RequestTrailerMap&) { | |
| void HttpTestServerDecoderFilter::setDecoderFilterCallbacks( | ||
| Envoy::Http::StreamDecoderFilterCallbacks& callbacks) { | ||
| decoder_callbacks_ = &callbacks; | ||
| // TODO(oschaaf): this adds locking in the hot path. Consider moving this into a separate | ||
| // extension, which will also allow tracking multiple points via configuration. | ||
| last_request_delta_ns_ = | ||
|
oschaaf marked this conversation as resolved.
Outdated
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did we (or can we) verify the performance implication of this? How likely is this going to cause significant regressions in performance tests that use the test server? If we can't verify it, we may be better off keeping this behind a feature flag to begin with? Or is it already on the path only when we ask for it by setting
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I couldn't measure a significant difference in manual tests. Having said that, I'd love to move this into its own extension because intuitively I'd say that there's no way that this can have a positive impact. But a separate extension would allow one to easily take this out of the equation altogether, and possibly also allow one to inject it into multiple places in the server side extension pipeline if desired so, emitting these timings from various places.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Considering the unknown performance impact, I think it would be worthwhile to move this to a new extension in this PR. Thank you for looking into it. |
||
| config_->getElapsedNanosSinceLastRequest(callbacks.dispatcher().timeSource()); | ||
| } | ||
|
|
||
| } // namespace Server | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,20 +2,68 @@ | |
|
|
||
| #include <string> | ||
|
|
||
| #include "envoy/common/time.h" | ||
| #include "envoy/server/filter_config.h" | ||
|
|
||
| #include "external/envoy/source/common/common/lock_guard.h" | ||
| #include "external/envoy/source/common/common/thread.h" | ||
|
|
||
| #include "api/server/response_options.pb.h" | ||
|
|
||
| namespace Nighthawk { | ||
| namespace Server { | ||
|
|
||
| // Basically this is left in as a placeholder for further configuration. | ||
| /** | ||
| * Filter configuration container class for the test server extension. | ||
| * Instances of this class will be shared accross instances of HttpTestServerDecoderFilter. | ||
| */ | ||
| class HttpTestServerDecoderFilterConfig { | ||
| public: | ||
| /** | ||
| * Constructs a new HttpTestServerDecoderFilterConfig instance. | ||
| * | ||
| * @param proto_config The proto configuration of the filter. | ||
| */ | ||
| HttpTestServerDecoderFilterConfig(nighthawk::server::ResponseOptions proto_config); | ||
|
|
||
| /** | ||
| * @return const nighthawk::server::ResponseOptions& read-only reference to the proto config | ||
| * object. | ||
| */ | ||
| const nighthawk::server::ResponseOptions& server_config() { return server_config_; } | ||
|
|
||
| /** | ||
| * Gets the number of elapsed nanoseconds since the last call (server wide). | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How useful is this statistic if it is server wide as compared to having it per each session? I likely don't fully understand the use case for this statistic, so this may as well be working as intended.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The way I see it is that the current approach is going to be particularly useful in A/B tests, for example: with and without an intermediary So the use case is sanity checking and exposing a low effort means to dig up a potential root cause of latency measurement divergence. I think per-worker or per-session tracking could also serve as a purpose to further dig into things as that would allow one to get a sense of balancedness across threads and sessions. Also see https://github.com/envoyproxy/nighthawk/pull/477/files#r476803233 for an earlier discussion related to this. |
||
| * Safe to use concurrently. | ||
| * | ||
| * @param time_source Time source that will be used to obain an updated monotonic time sample. | ||
| * @return uint64_t 0 on the first call, else the number of elapsed nanoseconds since the last | ||
| * call. | ||
| */ | ||
| static uint64_t getElapsedNanosSinceLastRequest(Envoy::TimeSource& time_source); | ||
|
|
||
| private: | ||
| /** | ||
| * Utility class for thread safe tracking of elapsed monotonic time. | ||
| */ | ||
| class ThreadSafeMontonicTimeStopwatch { | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this class part of the HttpTestServerDecoderFilterConfig? is there anything the binds them together other than the fact that one uses the other? Or can we (for clarity and reusability) define it as a standalone one?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't want other things to rely on it yet, but if we feel this is good to go for that, then we can move it out. Maybe it should go into a separate file then and have direct test coverage. Let me know what you think.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the idea of having it in its own file with test coverage is a good one and as a bonus we will end up with a simpler interface. |
||
| public: | ||
| ThreadSafeMontonicTimeStopwatch() : start_(Envoy::MonotonicTime::min()) {} | ||
|
oschaaf marked this conversation as resolved.
Outdated
|
||
| /** | ||
| * @param time_source used to obtain a sample of the current monotonic time. | ||
| * @return uint64_t 0 on the first invocation, and the number of elapsed nanoseconds since the | ||
| * last invocation otherwise. | ||
| */ | ||
| uint64_t getElapsedNsAndReset(Envoy::TimeSource& time_source); | ||
|
|
||
| private: | ||
| Envoy::Thread::MutexBasicLockable lock_; | ||
| Envoy::MonotonicTime start_ GUARDED_BY(lock_); | ||
| }; | ||
|
|
||
| static ThreadSafeMontonicTimeStopwatch& lastRequestStopwatch() { | ||
|
oschaaf marked this conversation as resolved.
Outdated
|
||
| MUTABLE_CONSTRUCT_ON_FIRST_USE(ThreadSafeMontonicTimeStopwatch); // NOLINT | ||
| } | ||
| const nighthawk::server::ResponseOptions server_config_; | ||
| }; | ||
|
|
||
|
|
@@ -43,6 +91,7 @@ class HttpTestServerDecoderFilter : public Envoy::Http::StreamDecoderFilter { | |
| bool json_merge_error_{false}; | ||
| std::string error_message_; | ||
| absl::optional<std::string> request_headers_dump_; | ||
| uint64_t last_request_delta_ns_; | ||
| }; | ||
|
|
||
| } // namespace Server | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.