Skip to content

listener: create per network filter chain stats scope#17931

Merged
mattklein123 merged 7 commits intoenvoyproxy:mainfrom
lambdai:scope
Sep 2, 2021
Merged

listener: create per network filter chain stats scope#17931
mattklein123 merged 7 commits intoenvoyproxy:mainfrom
lambdai:scope

Conversation

@lambdai
Copy link
Copy Markdown
Contributor

@lambdai lambdai commented Aug 31, 2021

Commit Message:
Maintain per network filter chain stats scope and avoid leaking stats when the owner filter chain is destroyed.

Additional Description:
The solution was discussed in this thread

Risk Level:
Testing: Integration test to adopt real stats that support query for destroyed stat.
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
Fix #17690

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
@mattklein123 mattklein123 self-assigned this Aug 31, 2021
mattklein123
mattklein123 previously approved these changes Aug 31, 2021
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks. LGTM.

@jmarantz can you also take a quick look?

Event::TestTimeSystem::RealTimeBound bound(timeout);
while (findGauge(store, name) != nullptr) {
time_system.advanceTimeWait(std::chrono::milliseconds(10));
if (timeout != std::chrono::milliseconds::zero() && !bound.withinBound()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a much better pattern for waiting for things to happen with stats. See NotifyingAllocatorImpl::waitForCounterExists. It uses a condvar rather than a spin-loop, which in my experience is inherently flaky.

It looks like in the context where you are calling this, you can use NotifyingAllocatorImpl so this should be a straightforward extension.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think NotifyingAllocatorImpl is not available when I use real_stats

    RELEASE_ASSERT(ret != nullptr,
                   "notifyingStatsAllocator() is not created when real_stats is true");

I choose real_stats only because real_stats support stat destroy.

Am I understanding it correctly?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea. Did you just see that in the code or did you get that failure during a test?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just took another look. At this level, your impl parallels another one. I have general flakiness concerns about this so it'd be better if we didn't need it.

However in test/integration/server.h you are implementing a virtual method that calls this one. However if you look right above that, in the impl of waitForCounterExists, you can see it using the notifyingStatsAllocator(), which is the better way to go.

It's possible the condvars are hooked up in that for counters and not gauges (I haven't checked) but it should be straightforward to replicate for gauges.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see the failure when I use real stats. Reason here
https://github.com/envoyproxy/envoy/blob/main/test/integration/server.cc#L206-L213

Below is what I call waitForCounterExist b/c waitForGauge{Exist,Destroyed} is not implemented yet.

[2021-09-01 17:28:52.512][12][critical][assert] [./test/integration/server.h:590] assert failure: ret != nullptr. Details: notifyingStatsAllocator() is not created when real_stats is true

Copy link
Copy Markdown
Contributor

@jmarantz jmarantz Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. That's just a bool arg to IntegrationTestServerImpl. Switch that bool to false so you can do these tests without risking flakes?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default is actually false, so I guess we need to figure out first why someone decided to set it to true.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's me in this PR :(

I jump to integration test from unit test because the mocked store doesn't support query for destroy. Maybe the non-real stats support query for destroy? Let me give it a try.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, confirmed real stats is not needed.

The notified version of waitForGaugeDestroyed need a bunch notification for Gauge while currently only Counter type support the notification way. Can I (or anyone else) add the notification for Gauge in another PR?

@jmarantz
Copy link
Copy Markdown
Contributor

jmarantz commented Sep 1, 2021

/wait

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Copy link
Copy Markdown
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, looks good.

Out of curiosity, how long do your new tests run with the 10ms spin-loop?

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
@lambdai
Copy link
Copy Markdown
Contributor Author

lambdai commented Sep 1, 2021

thanks, looks good.

Out of curiosity, how long do your new tests run with the 10ms spin-loop?

450ms - 500ms Is it a spin loop?

Copy link
Copy Markdown
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be a spin loop if we are using simulated time in the test. I don't remember.

It's not a good pattern in any case. If a single test method is taking 450ms to run that's quite a long time, which is why I think it's good to switch to the condvar.

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
@lambdai
Copy link
Copy Markdown
Contributor Author

lambdai commented Sep 1, 2021

I think it might be a spin loop if we are using simulated time in the test. I don't remember.

It's not a good pattern in any case. If a single test method is taking 450ms to run that's quite a long time, which is why I think it's good to switch to the condvar.

Yeah, agree. notification/callback is preferred.

I double check the underlying advanceTime is Thread::sleep_for through gdb.

We are not in the best shape for sure, but it's not too bad

@lambdai
Copy link
Copy Markdown
Contributor Author

lambdai commented Sep 1, 2021

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17931 (comment) was created by @lambdai.

see: more, trace.

@lambdai
Copy link
Copy Markdown
Contributor Author

lambdai commented Sep 2, 2021

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #17931 (comment) was created by @lambdai.

see: more, trace.

@mattklein123 mattklein123 merged commit c361b05 into envoyproxy:main Sep 2, 2021
@lambdai lambdai deleted the scope branch September 2, 2021 05:07
lambdai added a commit to lambdai/envoy-dai that referenced this pull request Sep 8, 2021
…roxy#17931)"

This reverts commit c361b05.

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
mattklein123 pushed a commit that referenced this pull request Sep 8, 2021
…" (#18017)

This reverts commit c361b05.

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
tyxia pushed a commit to tyxia/envoy that referenced this pull request Sep 21, 2021
…roxy#17931)" (envoyproxy#18017)

This reverts commit c361b05.

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Seeing stats for http connection manager that has been removed from the filter chain

3 participants