listener: create per network filter chain stats scope#17931
listener: create per network filter chain stats scope#17931mattklein123 merged 7 commits intoenvoyproxy:mainfrom
Conversation
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
mattklein123
left a comment
There was a problem hiding this comment.
Nice, thanks. LGTM.
@jmarantz can you also take a quick look?
test/test_common/utility.cc
Outdated
| Event::TestTimeSystem::RealTimeBound bound(timeout); | ||
| while (findGauge(store, name) != nullptr) { | ||
| time_system.advanceTimeWait(std::chrono::milliseconds(10)); | ||
| if (timeout != std::chrono::milliseconds::zero() && !bound.withinBound()) { |
There was a problem hiding this comment.
there's a much better pattern for waiting for things to happen with stats. See NotifyingAllocatorImpl::waitForCounterExists. It uses a condvar rather than a spin-loop, which in my experience is inherently flaky.
It looks like in the context where you are calling this, you can use NotifyingAllocatorImpl so this should be a straightforward extension.
There was a problem hiding this comment.
I think NotifyingAllocatorImpl is not available when I use real_stats
RELEASE_ASSERT(ret != nullptr,
"notifyingStatsAllocator() is not created when real_stats is true");
I choose real_stats only because real_stats support stat destroy.
Am I understanding it correctly?
There was a problem hiding this comment.
No idea. Did you just see that in the code or did you get that failure during a test?
There was a problem hiding this comment.
I just took another look. At this level, your impl parallels another one. I have general flakiness concerns about this so it'd be better if we didn't need it.
However in test/integration/server.h you are implementing a virtual method that calls this one. However if you look right above that, in the impl of waitForCounterExists, you can see it using the notifyingStatsAllocator(), which is the better way to go.
It's possible the condvars are hooked up in that for counters and not gauges (I haven't checked) but it should be straightforward to replicate for gauges.
There was a problem hiding this comment.
i see the failure when I use real stats. Reason here
https://github.com/envoyproxy/envoy/blob/main/test/integration/server.cc#L206-L213
Below is what I call waitForCounterExist b/c waitForGauge{Exist,Destroyed} is not implemented yet.
[2021-09-01 17:28:52.512][12][critical][assert] [./test/integration/server.h:590] assert failure: ret != nullptr. Details: notifyingStatsAllocator() is not created when real_stats is true
There was a problem hiding this comment.
I see. That's just a bool arg to IntegrationTestServerImpl. Switch that bool to false so you can do these tests without risking flakes?
There was a problem hiding this comment.
the default is actually false, so I guess we need to figure out first why someone decided to set it to true.
There was a problem hiding this comment.
It's me in this PR :(
I jump to integration test from unit test because the mocked store doesn't support query for destroy. Maybe the non-real stats support query for destroy? Let me give it a try.
There was a problem hiding this comment.
Yeah, confirmed real stats is not needed.
The notified version of waitForGaugeDestroyed need a bunch notification for Gauge while currently only Counter type support the notification way. Can I (or anyone else) add the notification for Gauge in another PR?
|
/wait |
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
jmarantz
left a comment
There was a problem hiding this comment.
thanks, looks good.
Out of curiosity, how long do your new tests run with the 10ms spin-loop?
450ms - 500ms Is it a spin loop? |
jmarantz
left a comment
There was a problem hiding this comment.
I think it might be a spin loop if we are using simulated time in the test. I don't remember.
It's not a good pattern in any case. If a single test method is taking 450ms to run that's quite a long time, which is why I think it's good to switch to the condvar.
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Yeah, agree. notification/callback is preferred. I double check the underlying advanceTime is Thread::sleep_for through gdb. We are not in the best shape for sure, but it's not too bad |
|
/retest |
|
Retrying Azure Pipelines: |
|
/retest |
|
Retrying Azure Pipelines: |
…roxy#17931)" This reverts commit c361b05. Signed-off-by: Yuchen Dai <silentdai@gmail.com>
…roxy#17931)" (envoyproxy#18017) This reverts commit c361b05. Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Commit Message:
Maintain per network filter chain stats scope and avoid leaking stats when the owner filter chain is destroyed.
Additional Description:
The solution was discussed in this thread
Risk Level:
Testing: Integration test to adopt real stats that support query for destroyed stat.
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
Fix #17690