-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grpc: Wait until resources finish cleaning up in Stop() and GracefulStop() #6489
Conversation
When Server.Stop() returns it is not guaranteed that the loopyWriter go-routine terminated. This can lead to a panic or a testing.(*common).logDepth() race condition in Go Tests because t.Log is used after or during the testcase terminates. This can happen when: - a GRPC server is started in a Go test, - the GRPC logs are forwarded to t.Log, - loopyWriter.run logs an error after server.Stop() and the Test method returns. [email protected]/internal/leakcheck is unable to detect that the loopyWriter go-routine continues to run after server.Stop() because it waits up to 10s for go-routines to terminate after a test finishes. The loopyWriter returns much faster after Stop() returns. To make server.Stop() wait until loopyWriter terminated: - rename the existing writerDone field, which is only used in tests, to loopyWriterDone, the writerDone channel is closed when the loopyWriter go-routine exited - change http2server.Close to wait until loopyWriterDone is closed
I don't know a good way to add a testcase for it because the mentioned issue with leakcheck. The testcase also detects another running go-routine of the grpc-client:
This looks like analogous to the loopyWriter issue. I only had a brief look, I guess can be fixed by changing |
Thank you for the PR!
Yes, this one is known: #2869. As for the approach here, it unfortunately has the effect of making the close operation sequential. Each transport will be closed one by one, and wait for the previous one to fully shut down before closing the next. This could be noticeable if there are hundreds or thousands of connections. I'm not sure if there are any things that could delay loopy writer from terminating, but if there is anything that could last up to a second (e.g. a slow network write), then that could make Also, this approach only fixes |
@dfawley thanks for your feedback, I'll try to implement the suggested approach. Do you maybe have an idea how to add a testcase for the issue that is compatible with the current test code? |
The only thing I could think of that wouldn't be flaky is a version of But 1. it's OK to not have a test for this behavior, and 2. if there is a test for it, it's fine if it only fails ~10% of the time when the bug is present. |
This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
…inated" This reverts commit ad96f3c.
It was not guaranteed that when Stop() and GracefulStop() returned, the loopyWriter go-routines terminated. This can lead to a panic or a testing.(*common).logDepth() race condition in Go Tests because t.Log is used after or during the testcase terminates. This can happen when: - a GRPC server is started in a Go test, - the GRPC logs are forwarded to t.Log, - loopyWriter.run logs an error after server.Stop() and the Test method returns. [email protected]/internal/leakcheck is unable to detect that the loopyWriter go-routine continues to run after server.Stop() because it waits up to 10s for go-routines to terminate after a test finishes. The loopyWriter returns much faster after Stop() returns. To make ensure Stop and GracefulStop only return after the loopyWriter go-routine terminated: - rename http2Server.writerDone to loopyWriterDone, the channel was only used in test and is closed when loopyWriter returns - wait in http2Server.HandleStreams() until loopyWriterDone was closed - wait in Server.Stop() until all connections were removed, this is already done in GracefulStop. To abort GracefulStop() when Stop() is called a local copy of Server.lis and Server.conns was created and both fields were set to nil and then s.csv.Broadcast() was called to wake up an eventually sleeping GracefulStop() execution. This does not work anymore when we need to wait for connections to be removed in Stop() we need to check the length of Server.conns. Instead a new abortGracefulStop atomicBool is introduced, that is set to true to indicate that a GracefulStop() invocation should be aborted when Stop() is run.
I pushed a new version, were I wait in HandleStreams() for the loopyWriter to terminate and wait in Stop() until all connections are removed. This does not work yet though. |
I think I see what's happening. Since |
The |
I'm also wondering why |
This doesn't sound right.
But you changed
This just seems to be an oversight. |
@dfawley that works! Tests currently fail on go 1.18 because atomic.Bool does not exist in that version. |
@dfawley I addressed them. Thanks a lot for you patience and great support and sorry for oversights at the end causing this unnecessary back and forth. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code all LGTM. All comments are minor nits with respect to comments/naming.
For the non inlined functions that are called while holding the server mutex, can you please add the suffix Locked(). Then I will approve and go ahead and merge :). |
@zasweq yes, it's done! Thanks for your review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you for the contribution.
It was not guaranteed that when Stop() and GracefulStop() returned, the
loopyWriter go-routines terminated.
This can lead to a panic or a testing.(*common).logDepth() race
condition in Go Tests because t.Log is used after or during the testcase
terminates.
This can happen when:
method returns.
[email protected]/internal/leakcheck is unable to detect that the
loopyWriter go-routine continues to run after server.Stop() because it
waits up to 10s for go-routines to terminate after a test finishes.
The loopyWriter returns much faster after Stop() returns.
To ensure that Stop and GracefulStop only return after the loopyWriter
go-routine terminates:
used in test and is closed when loopyWriter returns
already done in GracefulStop.
To abort GracefulStop() when Stop() is called a local copy of Server.lis and
Server.conns was created and both fields were set to nil and then
s.csv.Broadcast() was called to wake up an eventually sleeping
GracefulStop() execution. This does not work anymore when we need to
wait for connections to be removed in Stop() we need to check the
length of Server.conns.
Instead GracefulStop now waits until the the procedure completed.
GracefulStop now also stops running stream workers. Shared code between Stop and GracefulStop is moved to helper functions.
This fixes: #6485
RELEASE NOTES: