stats: fix req/resp body/hdrs sizes tracking#15908
stats: fix req/resp body/hdrs sizes tracking#15908jmarantz merged 57 commits intoenvoyproxy:mainfrom
Conversation
Using the upstream_host to find the upstream cluster doesn't work from ActiveStream, because it may not be available anymore by the time we look it up. So use ClusterInfo instead. Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
|
I'll follow-up with updated unit tests and an integration test. |
|
/wait |
Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
|
/wait |
Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
|
/wait |
|
Need to an integration test. /wait |
Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
jmarantz
left a comment
There was a problem hiding this comment.
looks good; just a few nits about auto
Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
jmarantz
left a comment
There was a problem hiding this comment.
@envoyproxy/senior-maintainers
|
Btw, I think we could merge this as is and I'll follow-up with an integration test to avoid regressions. Thanks! |
Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
alyssawilk
left a comment
There was a problem hiding this comment.
Looking good! One functional nit I'm sorry I missed in the first pass, but nearly there regardless :-)
Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
|
/wait |
This went through because I was using the same payload size for req/resp bodies - changed that to avoid the same issue in the future. Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
|
/wait |
We can circle back in the future if there's merit in excluding these values. Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
|
@jmarantz any further comments? |
|
oh actually you handed over to me, I'll assume it's all good :-) |
| while (true) { | ||
| auto histo = findByName<Stats::ParentHistogramSharedPtr>(store.histograms(), name); | ||
| if (histo) { | ||
| if (histo->cumulativeStatistics().sampleCount()) { |
There was a problem hiding this comment.
this call is not thread safe and is triggering TSAN errors, please fix ASAP.
https://dev.azure.com/cncf/4684fb3d-0389-4e0b-8251-221942316e06/_apis/build/builds/78702/logs/189
There was a problem hiding this comment.
Hmm, how come TSAN didn't complain with the original diff? I'll send a fix in a bit
There was a problem hiding this comment.
You sure it's me, seeing quic integration failing TSAN too:
There was a problem hiding this comment.
Well, the integration test I touched did become flaky so maybe it is me:
//test/integration:protocol_integration_test FLAKY, failed in 1 out of 6 in 227.4s
Stats over 6 runs: max = 227.4s, min = 214.3s, avg = 219.7s, dev = 5.2s
/build/tmp/_bazel_envoybuild/b570b5ccd0454dc9af9f65ab1833764d/execroot/envoy/bazel-out/k8-dbg/testlogs/test/integration/protocol_integration_test/shard_5_of_5/test_attempts/attempt_1.log
//test/integration:quic_http_integration_test FAILED in 4 out of 8 in 37.9s
Stats over 8 runs: max = 37.9s, min = 33.8s, avg = 35.5s, dev = 1.3s
Looking into it...
There was a problem hiding this comment.
the quic TSAN failure is here: https://gist.github.com/rgs1/1d352d45a78779b14e68cde4ac39e254
There was a problem hiding this comment.
Ok now I see mine:
https://gist.github.com/rgs1/bbd811c453484d8887a0448cc8c3d99c
This race is acceptable, so I think we can tell TSAN to ignore it.
There was a problem hiding this comment.
The TSAN failure involves waitUntilHistogramHasSamples which was introduced in this PR. That's why I suspect it is related to this.
2021-06-11T18:24:15.8290833Z #0 Envoy::Stats::HistogramStatisticsImpl::sampleCount() const /proc/self/cwd/./source/common/stats/histogram_impl.h:62:50 (protocol_integration_test+0x7c8b879)
2021-06-11T18:24:15.8292040Z #1 Envoy::TestUtility::waitUntilHistogramHasSamples(Envoy::Stats::Store&, std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, Envoy::Event::TestTimeSystem&, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >) /proc/self/cwd/test/test_common/utility.cc:233:41 (protocol_integration_test+0x794eef8)
This was introduced by envoyproxy#15908. We need to read a histogram's from the main thread to avoid data races between writes (from the main thread) and reads (from a worker thread). Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
This was introduced by #15908. We need to read a histogram's from the main thread to avoid data races between writes (from the main thread) and reads (from a worker thread). Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
Perform the accounting for these stats from UpstreamRequest, where we get access to the correct numbers. Things fixed: values are recorded now (they weren't previously) values will be exactly what's sent/received from upstream since they are recorded in the router after/before other filters the destination cluster will be the right one reqs/resps with trailers are handled (they weren't previously) added an integration test to prevent regressions Fixes envoyproxy#15879. Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
…xy#16946) This was introduced by envoyproxy#15908. We need to read a histogram's from the main thread to avoid data races between writes (from the main thread) and reads (from a worker thread). Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
Perform the accounting for these stats from UpstreamRequest, where we
get access to the correct numbers.
Things fixed:
in the router after/before other filters
Fixes #15879.
Signed-off-by: Raul Gutierrez Segales rgs@pinterest.com