Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v23/context: fix goroutine leak in WithRootCancel #214

Merged

Conversation

richhx
Copy link
Member

@richhx richhx commented Jul 16, 2021

Whenever a child context is created via WithRootCancel, a goroutine
spawns to gracefully handle closing the child context whenever the root
context gets canceled.

However, the current implementation leaks goroutines whenever the child
context exits before the root context exits. This pull request looks to
fix the problem by exiting the goroutine whenever the child context is
done.

See TestRootCancelGoroutineLeak for the reproduction use case and test
that demonstrates the leak. This is also easily visible via pprof
using the goroutine module.

Fix #1419

case <-rootCtx.Done():
cancel()
case <-ctx.Done():
cancel()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this line isn't really needed since it's a noop. I left this here to be consistent though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need it for the case that the child context is canceled but the root is not - this will be the common case for servers that forward requests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes I mean the cancel() on line 310. The case <-ctx.Done(): for sure is needed but I believe if the child context is canceled or finished, it'll enter the select case and run cancel(), which I believe would just be a noop.

Copy link
Contributor

@cosnicolaou cosnicolaou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - the only change I think that's necessary is to deflake the test.

I'm curious as to how you came to notice this? Is it for a specific configuration that was new to you or is it a 'day-zero'? It's undoubtedly a bug, so sorry for that!

buf := make([]byte, 2<<20)
buf = buf[:runtime.Stack(buf, true)]
count := 0
for _, g := range strings.Split(string(buf), "\n\n") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is likely going to be flaky, at least in terms of getting to 0. You may want to just check that some, say a third, of the goroutines have exited since there's no way any of them would exit without your change for this test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. This is indeed flaky after running go test -count=1000 -run TestRootCancelGoroutineLeak.
There were a few options I thought of:

  1. Do what you suggested, which is check some percentage of them exited.
  2. Make the test sleep for say 5-10ms after canceling all of the contexts and then check.
  3. Adjust the context API to expose whether the context's goroutine has ended via some internal context value check (e.g. similar to checking whether the context is done).
  4. Move the test to be an internal test (i.e. white box test) and adjust the test to wait for all of the contexts' goroutines to have finished via some internal cancel context value check (also similar to checking whether the context is done).

Approach 1 has been a little flaky in terms of figuring out the threshold.

Approach 2 seems to work. From what I've tested, it works for 10000 runs of the tests, but there's still always the possibility that it doesn't quite work depending on how go schedules the goroutines. It's also not great to have a test sleep and arbitrarily take up unnecessary time. This approach could also be done in combination with approach 1 but I prefer checking that all goroutines have really exited.

Approach 3 exposes unnecessary internals of the context package for the purposes of a test which isn't great imo.

Approach 4 requires some finagling to propagate the goroutine's changed value to the cancel context which violates some boundaries (or requires some more heavy handed adjustments).

I've implemented approach 2 since it's pretty easy to do and the least intrusive to the general codebase. Let me know your thoughts and happy to switch it around as needed. The following has run without any problems from my testing so far though.

go test -count=10000 -run TestRootCancelGoroutineLeak$ 

v23/context/context_test.go Show resolved Hide resolved
case <-rootCtx.Done():
cancel()
case <-ctx.Done():
cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need it for the case that the child context is canceled but the root is not - this will be the common case for servers that forward requests.

Whenever a child context is created via `WithRootCancel`, a goroutine
spawns to gracefully handle closing the child context whenever the root
context gets canceled.

However, the current implementation leaks goroutines whenever the child
context exits before the root context exits. This pull request looks to
fix the problem by exiting the goroutine whenever the child context is
done.

See `TestRootCancelGoroutineLeak` for the reproduction use case and test
that demonstrates the leak. This is also easily visible via `pprof`
using the `goroutine` module.
@richhx richhx force-pushed the scratch/rhuang/goroutine-leak-fix branch from 9b74d6a to 1391030 Compare July 21, 2021 18:50
@richhx
Copy link
Member Author

richhx commented Jul 21, 2021

I'm curious as to how you came to notice this? Is it for a specific configuration that was new to you or is it a 'day-zero'? It's undoubtedly a bug, so sorry for that!

No worries! We have prometheus metrics and a colleague noticed that the number of goroutines were increasing monotonically over time indicating there was a leak somewhere in our codebase. I looked into it via pprof/goroutine and noticed that it was coming from the v23 contexts.

@cosnicolaou
Copy link
Contributor

cosnicolaou commented Jul 21, 2021 via email

@richhx
Copy link
Member Author

richhx commented Jul 21, 2021

I'm not noticing any flakiness using just approach 2 after an additional 100000 runs so I'll go with that for now. Thanks!

@richhx richhx closed this Jul 21, 2021
@richhx richhx reopened this Jul 21, 2021
@richhx
Copy link
Member Author

richhx commented Jul 21, 2021

Oh looks like I don't have write access to the repository so I can't merge it :(

@cosnicolaou cosnicolaou merged commit b2e0a54 into vanadium:master Jul 22, 2021
cosnicolaou added a commit that referenced this pull request Jul 29, 2021
….19 (#216)

* v23/context: fix goroutine leak in WithRootCancel (#214)

Whenever a child context is created via `WithRootCancel`, a goroutine
spawns to gracefully handle closing the child context whenever the root
context gets canceled.

However, the current implementation leaks goroutines whenever the child
context exits before the root context exits. This pull request looks to
fix the problem by exiting the goroutine whenever the child context is
done.

See `TestRootCancelGoroutineLeak` for the reproduction use case and test
that demonstrates the leak. This is also easily visible via `pprof`
using the `goroutine` module.

Co-authored-by: Richard Huang <[email protected]>

* v23/context: fix lint complaint (#215)

Co-authored-by: Richard Huang <[email protected]>
Co-authored-by: Richard Huang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants