Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc: Wait until resources finish cleaning up in Stop() and GracefulStop() #6489

Merged
merged 31 commits into from
Oct 31, 2023

Conversation

fho
Copy link
Contributor

@fho fho commented Jul 31, 2023

It was not guaranteed that when Stop() and GracefulStop() returned, the
loopyWriter go-routines terminated.
This can lead to a panic or a testing.(*common).logDepth() race
condition in Go Tests because t.Log is used after or during the testcase
terminates.
This can happen when:

  • a GRPC server is started in a Go test,
  • the GRPC logs are forwarded to t.Log,
  • loopyWriter.run logs an error after server.Stop() and the Test
    method returns.

[email protected]/internal/leakcheck is unable to detect that the
loopyWriter go-routine continues to run after server.Stop() because it
waits up to 10s for go-routines to terminate after a test finishes.
The loopyWriter returns much faster after Stop() returns.

To ensure that Stop and GracefulStop only return after the loopyWriter
go-routine terminates:

  • rename http2Server.writerDone to loopyWriterDone, the channel was only
    used in test and is closed when loopyWriter returns
  • wait in http2Server.HandleStreams() until loopyWriterDone was closed
  • wait in Server.Stop() until all connections were removed, this is
    already done in GracefulStop.
    To abort GracefulStop() when Stop() is called a local copy of Server.lis and
    Server.conns was created and both fields were set to nil and then
    s.csv.Broadcast() was called to wake up an eventually sleeping
    GracefulStop() execution. This does not work anymore when we need to
    wait for connections to be removed in Stop() we need to check the
    length of Server.conns.
    Instead GracefulStop now waits until the the procedure completed.

GracefulStop now also stops running stream workers. Shared code between Stop and GracefulStop is moved to helper functions.

This fixes: #6485

RELEASE NOTES:

  • grpc: Wait until resources finish cleaning up in Stop() and GracefulStop()

When Server.Stop() returns it is not guaranteed that the loopyWriter
go-routine terminated.

This can lead to a panic or a testing.(*common).logDepth() race
condition in Go Tests because t.Log is used after or during the testcase
terminates.
This can happen when:
- a GRPC server is started in a Go test,
- the GRPC logs are forwarded to t.Log,
- loopyWriter.run logs an error after server.Stop() and the Test
  method returns.

[email protected]/internal/leakcheck is unable to detect that the
loopyWriter go-routine continues to run after server.Stop() because it
waits up to 10s for go-routines to terminate after a test finishes.
The loopyWriter returns much faster after Stop() returns.

To make server.Stop() wait until loopyWriter terminated:
- rename the existing writerDone field, which is only used in tests, to
  loopyWriterDone, the writerDone channel is closed when the loopyWriter
  go-routine exited
- change http2server.Close to wait until loopyWriterDone is closed
@fho
Copy link
Contributor Author

fho commented Jul 31, 2023

I don't know a good way to add a testcase for it because the mentioned issue with leakcheck.
To reproduce the issue consistently it would also be needed to inject into loopyWriter.run to wait a bit longer on shutdown.
To test the PR, I modified leakcheck, inserted a sleep in loopyWriter.run and did not use test.s: fho@f2dce22.

The testcase also detects another running go-routine of the grpc-client:

    leakcheck.go:116: Leaked goroutine: goroutine 13 [runnable]:
        runtime.Gosched()
        	/usr/lib/go/src/runtime/proc.go:321 +0x1a
        google.golang.org/grpc/internal/transport.(*loopyWriter).run(0xc000318930)
        	/home/fho/git/grpc-go/internal/transport/controlbuf.go:590 +0x34b
        google.golang.org/grpc/internal/transport.newHTTP2Client.func6()
        	/home/fho/git/grpc-go/internal/transport/http2_client.go:451 +0x165
        created by google.golang.org/grpc/internal/transport.newHTTP2Client
        	/home/fho/git/grpc-go/internal/transport/http2_client.go:449 +0x39d2
    leakcheck.go:116: Leaked goroutine: goroutine 20 [runnable]:

This looks like analogous to the loopyWriter issue. I only had a brief look, I guess can be fixed by changing readerDone in a similar manner. That is an issue for another PR.

@fho fho marked this pull request as ready for review July 31, 2023 15:07
@dfawley
Copy link
Member

dfawley commented Jul 31, 2023

Thank you for the PR!

That is an issue for another PR.

Yes, this one is known: #2869.

As for the approach here, it unfortunately has the effect of making the close operation sequential. Each transport will be closed one by one, and wait for the previous one to fully shut down before closing the next. This could be noticeable if there are hundreds or thousands of connections. I'm not sure if there are any things that could delay loopy writer from terminating, but if there is anything that could last up to a second (e.g. a slow network write), then that could make Stop() take minutes.

Also, this approach only fixes Stop() and not GracefulStop(). I think we could fix both with a different approach: e.g. we could make sure HandleStreams doesn't return until the writer has shut down. After HandleStreams returns is the point that we delete the connection from the set of connections. In addition to this, Stop would need the same kind of loop that waits for len(s.conns) to reach 0 like we do in GracefulStop.

@fho fho marked this pull request as draft July 31, 2023 16:11
@fho
Copy link
Contributor Author

fho commented Jul 31, 2023

@dfawley thanks for your feedback, I'll try to implement the suggested approach.

Do you maybe have an idea how to add a testcase for the issue that is compatible with the current test code?

@dfawley
Copy link
Member

dfawley commented Jul 31, 2023

Do you maybe have an idea how to add a testcase for the issue that is compatible with the current test code?

The only thing I could think of that wouldn't be flaky is a version of leakcheck that fails immediately instead of waiting 10 seconds. If you wanted it to be reliable, you'd have to inject some extra code to optionally Sleep as you mention. And that would only help catch the current goroutines, not any added in the future.

But 1. it's OK to not have a test for this behavior, and 2. if there is a test for it, it's fine if it only fails ~10% of the time when the bug is present.

@github-actions
Copy link

github-actions bot commented Aug 7, 2023

This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Aug 7, 2023
fho added 2 commits August 9, 2023 15:32
It was not guaranteed that when Stop() and GracefulStop() returned, the
loopyWriter go-routines terminated.
This can lead to a panic or a testing.(*common).logDepth() race
condition in Go Tests because t.Log is used after or during the testcase
terminates.
This can happen when:
- a GRPC server is started in a Go test,
- the GRPC logs are forwarded to t.Log,
- loopyWriter.run logs an error after server.Stop() and the Test
  method returns.

[email protected]/internal/leakcheck is unable to detect that the
loopyWriter go-routine continues to run after server.Stop() because it
waits up to 10s for go-routines to terminate after a test finishes.
The loopyWriter returns much faster after Stop() returns.

To make ensure Stop and GracefulStop only return after the loopyWriter
go-routine terminated:
- rename http2Server.writerDone to loopyWriterDone, the channel was only
  used in test and is closed when loopyWriter returns
- wait in http2Server.HandleStreams() until loopyWriterDone was closed
- wait in Server.Stop() until all connections were removed, this is
  already done in GracefulStop.
  To abort GracefulStop() when Stop() is called a local copy of Server.lis and
  Server.conns was created and both fields were set to nil and then
  s.csv.Broadcast() was called to wake up an eventually sleeping
  GracefulStop() execution. This does not work anymore when we need to
  wait for connections to be removed in Stop() we need to check the
  length of Server.conns.
  Instead a new abortGracefulStop atomicBool is introduced, that is
  set to true to indicate that a GracefulStop() invocation should be
  aborted when Stop() is run.
@fho
Copy link
Contributor Author

fho commented Aug 9, 2023

I pushed a new version, were I wait in HandleStreams() for the loopyWriter to terminate and wait in Stop() until all connections are removed. This does not work yet though.
When running the tests, Stop() is hanging forever when waiting for the connections to be removed (see CI Logs.
Does somebody has an idea what is wrong?

@github-actions github-actions bot removed the stale label Aug 9, 2023
@dfawley
Copy link
Member

dfawley commented Aug 9, 2023

Stop() is hanging forever when waiting for the connections to be removed (see CI Logs.
Does somebody has an idea what is wrong?

I think I see what's happening. HandleStreams is returning, but (*grpc.Server).serveStreams, which calls it, is not returning because the method handlers are still running. Because serveStreams is not returning, even though the connection is fully closed, removeConn is not executed, meaning the grpc.Server (and Stop) still thinks the connection is there.

Since HandleStreams now does not return until it has cleaned up the connection, I wonder if we can remove the waitgroup in serveStreams entirely now? That seems plausible.

@fho
Copy link
Contributor Author

fho commented Aug 10, 2023

Since HandleStreams now does not return until it has cleaned up the connection, I wonder if we can remove the waitgroup in serveStreams entirely now? That seems plausible.

The wg in Server.serveStreams() is used to wait is for a go-routine that runs Server.handleStream() as part of the st.HandleStreams() handle func.
I don't see how it would be ensured without the waitGroup that the Server.handleStream() go-routines return before Stop/GracefulStop would return.

@fho
Copy link
Contributor Author

fho commented Aug 10, 2023

I'm also wondering why s.stopServerWorkers is only called in Stop() but not in GracefulStop.

@dfawley
Copy link
Member

dfawley commented Aug 10, 2023

The wg in Server.serveStreams() is used to wait is for a go-routine that runs Server.handleStream() as part of the st.HandleStreams() handle func.

This doesn't sound right.

st.HandleStreams calls the closure that server.go passes it for each RPC. When an RPC starts, the first thing that does is increment the wg, and when it finishes, it calls wg.Done() on it. This means all the server handlers have to return before wg.Wait() will ever unblock.

But you changed st.HandleStreams so that it blocked until the transport was closed. At that point, it seems fine to dereference the st which is what happens when serveStreams returns (ref). We can't rely on the method handlers all returning when the stream ends (these tests that are failing are blocked in the method handlers, which presumably will be cleaned up later), because it's running code that isn't owned by gRPC at that point.

I'm also wondering why s.stopServerWorkers is only called in Stop() but not in GracefulStop.

This just seems to be an oversight.

@fho
Copy link
Contributor Author

fho commented Aug 11, 2023

@dfawley that works!
I still don't understand the code sufficiently to judge if my change is correct.
Should I keep the abortGracefulStop atomic.Bool or is there a better way to abort the GracefulStop?

Tests currently fail on go 1.18 because atomic.Bool does not exist in that version.
Because Golang 1.21 was released and the min. required go version will therefore be changed to 1.19 which supports it, I kept it.

@fho fho marked this pull request as ready for review August 11, 2023 14:55
@fho
Copy link
Contributor Author

fho commented Oct 27, 2023

@dfawley I addressed them. Thanks a lot for you patience and great support and sorry for oversights at the end causing this unnecessary back and forth.

@zasweq zasweq self-requested a review October 27, 2023 17:47
@zasweq zasweq self-assigned this Oct 27, 2023
Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code all LGTM. All comments are minor nits with respect to comments/naming.

server.go Outdated Show resolved Hide resolved
server.go Outdated Show resolved Hide resolved
server.go Outdated Show resolved Hide resolved
server.go Outdated Show resolved Hide resolved
server.go Outdated Show resolved Hide resolved
server.go Outdated Show resolved Hide resolved
test/gracefulstop_test.go Outdated Show resolved Hide resolved
test/gracefulstop_test.go Outdated Show resolved Hide resolved
@zasweq zasweq removed their assignment Oct 27, 2023
@zasweq
Copy link
Contributor

zasweq commented Oct 31, 2023

For the non inlined functions that are called while holding the server mutex, can you please add the suffix Locked(). Then I will approve and go ahead and merge :).

@zasweq zasweq assigned fho and unassigned zasweq Oct 31, 2023
@fho
Copy link
Contributor Author

fho commented Oct 31, 2023

@zasweq yes, it's done! Thanks for your review

Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for the contribution.

@zasweq zasweq changed the title transport/http2server: wait in Close() until loopyWriter terminated grpc: Wait until resources finish cleaning up in Stop() and GracefulStop() Oct 31, 2023
@zasweq zasweq merged commit 70f1a40 into grpc:master Oct 31, 2023
13 checks passed
arvindbr8 pushed a commit to arvindbr8/grpc-go that referenced this pull request Nov 7, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Behavior Change Behavior changes not categorized as bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

loopyWriter continuing to run after GracefulStop causing race with t.Log
6 participants