-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v23/context: fix goroutine leak in WithRootCancel #214
v23/context: fix goroutine leak in WithRootCancel #214
Conversation
case <-rootCtx.Done(): | ||
cancel() | ||
case <-ctx.Done(): | ||
cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this line isn't really needed since it's a noop. I left this here to be consistent though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need it for the case that the child context is canceled but the root is not - this will be the common case for servers that forward requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes I mean the cancel()
on line 310. The case <-ctx.Done():
for sure is needed but I believe if the child context is canceled or finished, it'll enter the select case and run cancel(), which I believe would just be a noop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - the only change I think that's necessary is to deflake the test.
I'm curious as to how you came to notice this? Is it for a specific configuration that was new to you or is it a 'day-zero'? It's undoubtedly a bug, so sorry for that!
buf := make([]byte, 2<<20) | ||
buf = buf[:runtime.Stack(buf, true)] | ||
count := 0 | ||
for _, g := range strings.Split(string(buf), "\n\n") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is likely going to be flaky, at least in terms of getting to 0. You may want to just check that some, say a third, of the goroutines have exited since there's no way any of them would exit without your change for this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point. This is indeed flaky after running go test -count=1000 -run TestRootCancelGoroutineLeak
.
There were a few options I thought of:
- Do what you suggested, which is check some percentage of them exited.
- Make the test sleep for say 5-10ms after canceling all of the contexts and then check.
- Adjust the context API to expose whether the context's goroutine has ended via some internal context value check (e.g. similar to checking whether the context is done).
- Move the test to be an internal test (i.e. white box test) and adjust the test to wait for all of the contexts' goroutines to have finished via some internal cancel context value check (also similar to checking whether the context is done).
Approach 1 has been a little flaky in terms of figuring out the threshold.
Approach 2 seems to work. From what I've tested, it works for 10000 runs of the tests, but there's still always the possibility that it doesn't quite work depending on how go schedules the goroutines. It's also not great to have a test sleep and arbitrarily take up unnecessary time. This approach could also be done in combination with approach 1 but I prefer checking that all goroutines have really exited.
Approach 3 exposes unnecessary internals of the context package for the purposes of a test which isn't great imo.
Approach 4 requires some finagling to propagate the goroutine's changed value to the cancel context which violates some boundaries (or requires some more heavy handed adjustments).
I've implemented approach 2 since it's pretty easy to do and the least intrusive to the general codebase. Let me know your thoughts and happy to switch it around as needed. The following has run without any problems from my testing so far though.
go test -count=10000 -run TestRootCancelGoroutineLeak$
case <-rootCtx.Done(): | ||
cancel() | ||
case <-ctx.Done(): | ||
cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need it for the case that the child context is canceled but the root is not - this will be the common case for servers that forward requests.
Whenever a child context is created via `WithRootCancel`, a goroutine spawns to gracefully handle closing the child context whenever the root context gets canceled. However, the current implementation leaks goroutines whenever the child context exits before the root context exits. This pull request looks to fix the problem by exiting the goroutine whenever the child context is done. See `TestRootCancelGoroutineLeak` for the reproduction use case and test that demonstrates the leak. This is also easily visible via `pprof` using the `goroutine` module.
9b74d6a
to
1391030
Compare
No worries! We have prometheus metrics and a colleague noticed that the number of goroutines were increasing monotonically over time indicating there was a leak somewhere in our codebase. I looked into it via |
I’d opt for a combo of 1 & 2, or if that’s still flaky run the test up to n
times (say is 3) and declare victory when one passes! Thanks for the
explanation of how it was discovered. Feel free to merge if you can, if not
let me know and I’ll merge tomorrow sometime. Check with Noah as to which
branch you’re using internally.
Cheers, Cos.
…On Wed, Jul 21, 2021 at 9:01 PM Richard Huang ***@***.***> wrote:
I'm curious as to how you came to notice this? Is it for a specific
configuration that was new to you or is it a 'day-zero'? It's undoubtedly a
bug, so sorry for that!
No worries! We have prometheus metrics and a colleague noticed that the
number of goroutines were increasing monotonically over time indicating
there was a leak somewhere in our codebase. I looked into it via
pprof/goroutine and noticed that it was coming from the v23 contexts.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#214 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACCMUYCMVQKMU23Z5PKKTITTY4KRPANCNFSM5AQBQBSQ>
.
|
I'm not noticing any flakiness using just approach 2 after an additional 100000 runs so I'll go with that for now. Thanks! |
Oh looks like I don't have write access to the repository so I can't merge it :( |
….19 (#216) * v23/context: fix goroutine leak in WithRootCancel (#214) Whenever a child context is created via `WithRootCancel`, a goroutine spawns to gracefully handle closing the child context whenever the root context gets canceled. However, the current implementation leaks goroutines whenever the child context exits before the root context exits. This pull request looks to fix the problem by exiting the goroutine whenever the child context is done. See `TestRootCancelGoroutineLeak` for the reproduction use case and test that demonstrates the leak. This is also easily visible via `pprof` using the `goroutine` module. Co-authored-by: Richard Huang <[email protected]> * v23/context: fix lint complaint (#215) Co-authored-by: Richard Huang <[email protected]> Co-authored-by: Richard Huang <[email protected]>
Whenever a child context is created via
WithRootCancel
, a goroutinespawns to gracefully handle closing the child context whenever the root
context gets canceled.
However, the current implementation leaks goroutines whenever the child
context exits before the root context exits. This pull request looks to
fix the problem by exiting the goroutine whenever the child context is
done.
See
TestRootCancelGoroutineLeak
for the reproduction use case and testthat demonstrates the leak. This is also easily visible via
pprof
using the
goroutine
module.Fix #1419