Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transport: Fix deadlock in client keepalive. #1460

Merged
merged 1 commit into from
Aug 25, 2017
Merged

Conversation

tsuna
Copy link
Contributor

@tsuna tsuna commented Aug 23, 2017

When gRPC keepalives are enabled (which isn't the case by default at
this time) and PermitWithoutStream is false (the default), the client
can deadlock when transitioning between having no active stream and
having one active stream. Subsequent attempts to create a new stream
or to close the client will hang on the transport's mutex, while the
keepalive goroutine is waiting indefinitely on a channel while holding
the transport's mutex.

This fixes #1459.

@MakMukhi
Copy link
Contributor

Hey @tsuna Thanks for looking into the deadlock and coming up with a solution.
May I suggest an alternative approach:
The purpose of awakenKeepalive channel is tell operateHeader method if keepalive routine is dormant or not. If the channel is writable (i.e has no data on it) the keepalive routine must be dormant. Thus,

  1. It needs to be woken up by writing on it, and
  2. a keepalive ping must be sent out.

When we initialize the transport we make this channel non-writable by writing some data on it so that later when keepalive routine realizes it must go dormant it makes the channel writable by reading from it. Notice at this time a lock must be acquired since the condition to go dormant depends on number of active streams.
This works fine for the first time; the 1st stream comes in, operateHeader kicks off keepalive routine out of dormancy by writing on it and sends a ping. However, at this time after keepalive routine has read from the channel and woken up the channel is writable! This is erroneous since awakenKeepalive must be writable only when keepalive routine is dormant.
So, the next time around when keepalive routine comes to realize that it must go dormant it tries to read from the awakenKeepalive channel. But since awakenKeepalive is writable it doesn't have any data on it and can't be read from. At this point the keepalive routine waits having held the lock causing this deadlock.

I suggest we make awakenKeepalive non-writable by writing data on it again in operateHeader.
If this code is executed that must mean that keepalive routine is waiting here. And now since the channel is writable we can write on it again after this line so that awakenKeepalive becomes non-writable when keepalive routine is non-dormant.

@tsuna
Copy link
Contributor Author

tsuna commented Aug 24, 2017

Just to be sure I understand your suggestion, you're saying that right after this code:

        if len(t.activeStreams) == 1 {
                select {
                case t.awakenKeepalive <- struct{}{}:
                        t.framer.writePing(false, false, [8]byte{})

we need to add:

                        t.awakenKeepalive <- struct{}{}

if yes, then this appears to works, although it looks weird. Definitely something that's going to deserve a comment in the code.

I'm not sure why you want to keep holing t.mu when waiting on awakenKeepalive. Holding locks while blocking on channels is a common recipe for deadlocks... unless you force strict lock+channel ordering (basically consider channels like locks and enforce strict ordering). In this particular case, there is no benefit in keeping the lock while waiting on the channel.

Either way, I'm happy as long as this hole is plugged.

@dfawley dfawley requested a review from MakMukhi August 24, 2017 16:53
@MakMukhi
Copy link
Contributor

@tsuna Yes you understood that right and glad that it solves the problem.

Your concern is valid that this code is quite complicated with the locks and channels. However, holding that lock while making awakenKeepalive writable again is necessary. If we released the lock before reading off the channel, it might so happen that NewStream code might get executed which tries to write on the channel but doesn't since the channel isn't writable yet. Later when keepalive routine starts executing again and expects to read from awakenKeepalive it won't be able to since it missed the opportunity. Holding the lock ensures that if keepalive sees number of streams to be 0 it can safely go dormant by making awakenKeepalive writable.

@tsuna
Copy link
Contributor Author

tsuna commented Aug 24, 2017

You're right, good catch.

When gRPC keepalives are enabled (which isn't the case by default at
this time) and PermitWithoutStream is false (the default), the client
can deadlock when transitioning between having no active stream and
having one active stream.  Subsequent attempts to create a new stream
or to close the client will hang on the transport's mutex, while the
keepalive goroutine is waiting indefinitely on a channel while holding
the transport's mutex.

This fixes grpc#1459.
@tsuna
Copy link
Contributor Author

tsuna commented Aug 24, 2017

I amended my commit, PTAL. Already have a CLA on file with Google.

@MakMukhi
Copy link
Contributor

@tsuna Thanks for taking care of this. Looks good.

@MakMukhi MakMukhi merged commit c29d638 into grpc:master Aug 25, 2017
@menghanl menghanl added the 1.6 label Aug 28, 2017
@dfawley dfawley modified the milestone: 1.6 Release Aug 28, 2017
menghanl pushed a commit to menghanl/grpc-go that referenced this pull request Aug 30, 2017
When gRPC keepalives are enabled (which isn't the case by default at
this time) and PermitWithoutStream is false (the default), the client
can deadlock when transitioning between having no active stream and
having one active stream.  Subsequent attempts to create a new stream
or to close the client will hang on the transport's mutex, while the
keepalive goroutine is waiting indefinitely on a channel while holding
the transport's mutex.

This fixes grpc#1459.
@lock lock bot locked as resolved and limited conversation to collaborators Jan 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gRPC keepalive can deadlock on the client side
4 participants