Skip to content

stats: fix req/resp body/hdrs sizes tracking#15908

Merged
jmarantz merged 57 commits intoenvoyproxy:mainfrom
rgs1:issue-15879
Jun 11, 2021
Merged

stats: fix req/resp body/hdrs sizes tracking#15908
jmarantz merged 57 commits intoenvoyproxy:mainfrom
rgs1:issue-15879

Conversation

@rgs1
Copy link
Member

@rgs1 rgs1 commented Apr 9, 2021

Perform the accounting for these stats from UpstreamRequest, where we
get access to the correct numbers.

Things fixed:

  • values are recorded now (they weren't previously)
  • values will be exactly what's sent/received from upstream since they are recorded
    in the router after/before other filters
  • the destination cluster will be the right one
  • reqs/resps with trailers are handled (they weren't previously)
  • added an integration test to prevent regressions

Fixes #15879.

Signed-off-by: Raul Gutierrez Segales rgs@pinterest.com

Using the upstream_host to find the upstream cluster doesn't
work from ActiveStream, because it may not be available anymore
by the time we look it up. So use ClusterInfo instead.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
@rgs1 rgs1 changed the title cluster: fix req/resp body/hdrs sizes tracking stats: fix req/resp body/hdrs sizes tracking Apr 9, 2021
@rgs1
Copy link
Member Author

rgs1 commented Apr 9, 2021

I'll follow-up with updated unit tests and an integration test.

@rgs1
Copy link
Member Author

rgs1 commented Apr 9, 2021

/wait

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
@rgs1
Copy link
Member Author

rgs1 commented Apr 9, 2021

/wait

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
@rgs1
Copy link
Member Author

rgs1 commented Apr 10, 2021

/wait

@rgs1
Copy link
Member Author

rgs1 commented Apr 11, 2021

Need to an integration test.

/wait

Raul Gutierrez Segales added 2 commits April 12, 2021 18:54
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good; just a few nits about auto

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
jmarantz
jmarantz previously approved these changes Apr 13, 2021
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@envoyproxy/senior-maintainers

@rgs1
Copy link
Member Author

rgs1 commented Apr 13, 2021

Btw, I think we could merge this as is and I'll follow-up with an integration test to avoid regressions. Thanks!

Raul Gutierrez Segales added 2 commits April 13, 2021 13:47
Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! One functional nit I'm sorry I missed in the first pass, but nearly there regardless :-)

Raul Gutierrez Segales added 2 commits June 10, 2021 14:09
@rgs1
Copy link
Member Author

rgs1 commented Jun 10, 2021

/wait

This went through because I was using the same payload size for req/resp
bodies - changed that to avoid the same issue in the future.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
@rgs1
Copy link
Member Author

rgs1 commented Jun 10, 2021

/wait

We can circle back in the future if there's merit in excluding
these values.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
@alyssawilk
Copy link
Contributor

@jmarantz any further comments?

@alyssawilk
Copy link
Contributor

oh actually you handed over to me, I'll assume it's all good :-)

@jmarantz jmarantz dismissed snowp’s stale review June 11, 2021 12:18

All changes addessed

@jmarantz jmarantz merged commit f3f8df1 into envoyproxy:main Jun 11, 2021
while (true) {
auto histo = findByName<Stats::ParentHistogramSharedPtr>(store.histograms(), name);
if (histo) {
if (histo->cumulativeStatistics().sampleCount()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this call is not thread safe and is triggering TSAN errors, please fix ASAP.

https://dev.azure.com/cncf/4684fb3d-0389-4e0b-8251-221942316e06/_apis/build/builds/78702/logs/189

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, how come TSAN didn't complain with the original diff? I'll send a fix in a bit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the integration test I touched did become flaky so maybe it is me:

//test/integration:protocol_integration_test                              FLAKY, failed in 1 out of 6 in 227.4s
  Stats over 6 runs: max = 227.4s, min = 214.3s, avg = 219.7s, dev = 5.2s
  /build/tmp/_bazel_envoybuild/b570b5ccd0454dc9af9f65ab1833764d/execroot/envoy/bazel-out/k8-dbg/testlogs/test/integration/protocol_integration_test/shard_5_of_5/test_attempts/attempt_1.log
//test/integration:quic_http_integration_test                            FAILED in 4 out of 8 in 37.9s
  Stats over 8 runs: max = 37.9s, min = 33.8s, avg = 35.5s, dev = 1.3s

Looking into it...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok now I see mine:

https://gist.github.com/rgs1/bbd811c453484d8887a0448cc8c3d99c

This race is acceptable, so I think we can tell TSAN to ignore it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TSAN failure involves waitUntilHistogramHasSamples which was introduced in this PR. That's why I suspect it is related to this.

2021-06-11T18:24:15.8290833Z #0 Envoy::Stats::HistogramStatisticsImpl::sampleCount() const /proc/self/cwd/./source/common/stats/histogram_impl.h:62:50 (protocol_integration_test+0x7c8b879)
2021-06-11T18:24:15.8292040Z #1 Envoy::TestUtility::waitUntilHistogramHasSamples(Envoy::Stats::Store&, std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, Envoy::Event::TestTimeSystem&, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >) /proc/self/cwd/test/test_common/utility.cc:233:41 (protocol_integration_test+0x794eef8)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible fix: #16946

rgs1 pushed a commit to rgs1/envoy that referenced this pull request Jun 11, 2021
This was introduced by envoyproxy#15908. We need to read a histogram's from
the main thread to avoid data races between writes (from the main
thread) and reads (from a worker thread).

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
mattklein123 pushed a commit that referenced this pull request Jun 14, 2021
This was introduced by #15908. We need to read a histogram's from
the main thread to avoid data races between writes (from the main
thread) and reads (from a worker thread).

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
leyao-daily pushed a commit to leyao-daily/envoy that referenced this pull request Sep 30, 2021
Perform the accounting for these stats from UpstreamRequest, where we
get access to the correct numbers.

Things fixed:

values are recorded now (they weren't previously)
values will be exactly what's sent/received from upstream since they are recorded
in the router after/before other filters
the destination cluster will be the right one
reqs/resps with trailers are handled (they weren't previously)
added an integration test to prevent regressions
Fixes envoyproxy#15879.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
leyao-daily pushed a commit to leyao-daily/envoy that referenced this pull request Sep 30, 2021
…xy#16946)

This was introduced by envoyproxy#15908. We need to read a histogram's from
the main thread to avoid data races between writes (from the main
thread) and reads (from a worker thread).

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

track_cluster_stats/request_response_sizes not working

6 participants