transport: Fix deadlock in client keepalive. #1460

tsuna · 2017-08-23T13:19:45Z

When gRPC keepalives are enabled (which isn't the case by default at
this time) and PermitWithoutStream is false (the default), the client
can deadlock when transitioning between having no active stream and
having one active stream. Subsequent attempts to create a new stream
or to close the client will hang on the transport's mutex, while the
keepalive goroutine is waiting indefinitely on a channel while holding
the transport's mutex.

This fixes #1459.

MakMukhi · 2017-08-23T18:29:37Z

Hey @tsuna Thanks for looking into the deadlock and coming up with a solution.
May I suggest an alternative approach:
The purpose of awakenKeepalive channel is tell operateHeader method if keepalive routine is dormant or not. If the channel is writable (i.e has no data on it) the keepalive routine must be dormant. Thus,

It needs to be woken up by writing on it, and
a keepalive ping must be sent out.

When we initialize the transport we make this channel non-writable by writing some data on it so that later when keepalive routine realizes it must go dormant it makes the channel writable by reading from it. Notice at this time a lock must be acquired since the condition to go dormant depends on number of active streams.
This works fine for the first time; the 1st stream comes in, operateHeader kicks off keepalive routine out of dormancy by writing on it and sends a ping. However, at this time after keepalive routine has read from the channel and woken up the channel is writable! This is erroneous since awakenKeepalive must be writable only when keepalive routine is dormant.
So, the next time around when keepalive routine comes to realize that it must go dormant it tries to read from the awakenKeepalive channel. But since awakenKeepalive is writable it doesn't have any data on it and can't be read from. At this point the keepalive routine waits having held the lock causing this deadlock.

I suggest we make awakenKeepalive non-writable by writing data on it again in operateHeader.
If this code is executed that must mean that keepalive routine is waiting here. And now since the channel is writable we can write on it again after this line so that awakenKeepalive becomes non-writable when keepalive routine is non-dormant.

tsuna · 2017-08-24T00:34:41Z

Just to be sure I understand your suggestion, you're saying that right after this code:

        if len(t.activeStreams) == 1 {
                select {
                case t.awakenKeepalive <- struct{}{}:
                        t.framer.writePing(false, false, [8]byte{})

we need to add:

                        t.awakenKeepalive <- struct{}{}

if yes, then this appears to works, although it looks weird. Definitely something that's going to deserve a comment in the code.

I'm not sure why you want to keep holing t.mu when waiting on awakenKeepalive. Holding locks while blocking on channels is a common recipe for deadlocks... unless you force strict lock+channel ordering (basically consider channels like locks and enforce strict ordering). In this particular case, there is no benefit in keeping the lock while waiting on the channel.

Either way, I'm happy as long as this hole is plugged.

MakMukhi · 2017-08-24T18:31:29Z

@tsuna Yes you understood that right and glad that it solves the problem.

Your concern is valid that this code is quite complicated with the locks and channels. However, holding that lock while making awakenKeepalive writable again is necessary. If we released the lock before reading off the channel, it might so happen that NewStream code might get executed which tries to write on the channel but doesn't since the channel isn't writable yet. Later when keepalive routine starts executing again and expects to read from awakenKeepalive it won't be able to since it missed the opportunity. Holding the lock ensures that if keepalive sees number of streams to be 0 it can safely go dormant by making awakenKeepalive writable.

tsuna · 2017-08-24T18:48:05Z

You're right, good catch.

When gRPC keepalives are enabled (which isn't the case by default at this time) and PermitWithoutStream is false (the default), the client can deadlock when transitioning between having no active stream and having one active stream. Subsequent attempts to create a new stream or to close the client will hang on the transport's mutex, while the keepalive goroutine is waiting indefinitely on a channel while holding the transport's mutex. This fixes grpc#1459.

tsuna · 2017-08-24T18:55:57Z

I amended my commit, PTAL. Already have a CLA on file with Google.

MakMukhi · 2017-08-25T18:18:42Z

@tsuna Thanks for taking care of this. Looks good.

When gRPC keepalives are enabled (which isn't the case by default at this time) and PermitWithoutStream is false (the default), the client can deadlock when transitioning between having no active stream and having one active stream. Subsequent attempts to create a new stream or to close the client will hang on the transport's mutex, while the keepalive goroutine is waiting indefinitely on a channel while holding the transport's mutex. This fixes grpc#1459.

dfawley requested a review from MakMukhi August 24, 2017 16:53

tsuna force-pushed the master branch from 76184f4 to e605a1e Compare August 24, 2017 18:55

menghanl assigned MakMukhi Aug 24, 2017

dfawley added the Type: Bug label Aug 24, 2017

MakMukhi approved these changes Aug 25, 2017

View reviewed changes

MakMukhi merged commit c29d638 into grpc:master Aug 25, 2017

menghanl added the 1.6 label Aug 28, 2017

dfawley modified the milestone: 1.6 Release Aug 28, 2017

lock bot locked as resolved and limited conversation to collaborators Jan 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transport: Fix deadlock in client keepalive. #1460

transport: Fix deadlock in client keepalive. #1460

tsuna commented Aug 23, 2017

MakMukhi commented Aug 23, 2017

tsuna commented Aug 24, 2017

MakMukhi commented Aug 24, 2017

tsuna commented Aug 24, 2017

tsuna commented Aug 24, 2017

MakMukhi commented Aug 25, 2017

transport: Fix deadlock in client keepalive. #1460

transport: Fix deadlock in client keepalive. #1460

Conversation

tsuna commented Aug 23, 2017

MakMukhi commented Aug 23, 2017

tsuna commented Aug 24, 2017

MakMukhi commented Aug 24, 2017

tsuna commented Aug 24, 2017

tsuna commented Aug 24, 2017

MakMukhi commented Aug 25, 2017