Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected transport closing: too many pings from client #1882

Closed
immesys opened this issue Feb 24, 2018 · 33 comments · Fixed by #1905
Closed

Unexpected transport closing: too many pings from client #1882

immesys opened this issue Feb 24, 2018 · 33 comments · Fixed by #1905
Assignees

Comments

@immesys
Copy link

immesys commented Feb 24, 2018

What version of gRPC are you using?

583a630

What version of Go are you using (go version)?

1.10

What operating system (Linux, Windows, …) and version?

Linux AMD64, Kernel 4.10

What did you do?

When I have the server configured with GZIP compression as so:

gzcp := grpc.NewGZIPCompressor()
grpcServer := grpc.NewServer(grpc.RPCCompressor(gzcp))

Then when serving thousands of concurrent requests a second, clients will occasionally be disconnected with

rpc error: code = Unavailable desc = transport is closing

I see no errors from the server, and the both the client and server are far from overloaded (<10% CPU usage etc). Not all clients are affected at once, it will just be one connection which gets this error.

While trying to debug this, I disabled GZIP compression so I could use more easily look at packet captures. I am unable to reproduce this error once the GZIP compressor is no longer in use.

This issue is mostly to ask what the best way to proceed with diagnosing the problem is, or if there are any reasons why having a compressor would change the behavior of the system (aside from CPU usage which I don't think is a problem).

@immesys
Copy link
Author

immesys commented Feb 25, 2018

Sorry, I can reproduce this even without the GZIP compressor. I still don't know how to debug it though. The only error I get on the client with loglevel set to info is:

pickfirstBalancer: HandleSubConnStateChange: 0xc421bc6160, TRANSIENT_FAILURE

@MakMukhi
Copy link
Contributor

MakMukhi commented Feb 25, 2018 via email

@immesys
Copy link
Author

immesys commented Feb 25, 2018

The client respects those variables, but the server does not, perhaps the etcd client in the server (which uses grpc for its own stuff) is overriding it? Is there a way to set the log level in the code?

EDIT: I see, etcd calls SetLoggerV2, pity that is global, but I can work around that.

@immesys
Copy link
Author

immesys commented Feb 25, 2018

I finally got a new piece of information. The server prints this shortly before the client disconnects:

ERROR: 2018/02/24 19:16:02 transport: Got too many pings from the client, closing the connection.
ERROR: 2018/02/24 19:16:02 transport: Error while handling item. Err: connection error: desc = "transport is closing"

What does "too many pings" mean?

The client prints this with the new verbosity:

INFO: 2018/02/24 19:16:02 Client received GoAway with http2.ErrCodeEnhanceYourCalm.
INFO: 2018/02/24 19:16:02 pickfirstBalancer: HandleSubConnStateChange: 0xc420b7c4b0, TRANSIENT_FAILURE

@immesys immesys changed the title Unexpected transport closing under load with GZIPCompressor Unexpected transport closing: too many pings from client Feb 25, 2018
@immesys
Copy link
Author

immesys commented Feb 25, 2018

Incidentally, neither the server nor the client passes any keep alive parameters, everything is default

@MakMukhi
Copy link
Contributor

MakMukhi commented Feb 25, 2018 via email

@immesys
Copy link
Author

immesys commented Feb 25, 2018

But both sides are built with the same version of grpc, neither specify any keepalive parameters and it only happens under heavy load.

@MakMukhi
Copy link
Contributor

MakMukhi commented Feb 25, 2018

If it's not keepalive then the only thing inside gRPC that sends pings is bdp estimator which will be triggered on receiving data. However, every time the server sends data frames it resets its pingStrike counter. To quickly check if it's the bdp estimator or not, try turning it off by setting WithInitialWindowSize and WIthInitialConnectionWindowSize dial options on the client.

@immesys
Copy link
Author

immesys commented Feb 25, 2018

Thanks for the suggestion. I have set both of those options (to 1MB) and will leave it running overnight. It typically takes somewhere between 2 minutes and 20 minutes under load to trigger, so a few hours should be conclusive.

@immesys
Copy link
Author

immesys commented Feb 25, 2018

It's been running without any problem for about 9 hours, so I think that confirms that the BDP estimator is at fault. I know nothing about how pings or BDP estimation works inside grpc, but surely if the pings are sent when data goes from client to the server and the pingStrike counter is only reset when data is sent the other way then if the RPC operations spike in latency (e.g mine go to hundreds of milliseconds) then you can exceed the counter simply because the server isn't yet responding to any RPC calls so isn't doing any resets?

@MakMukhi
Copy link
Contributor

A bdp estimator ping is sent by the client when it receives data from the server. And the server must have reset its counter for that dataFrame. In an old version the server was buggy and was resetting its counter only once for a message but you don't seem to be using that.
Thanks for narrowing it down. I'll take a look at it Monday. I don't suppose it would be easy for you to create a repro of it but if you find a way can you please upload it.

@immesys
Copy link
Author

immesys commented Feb 26, 2018

I made a simple reproducer where a program connects to itself and invokes an operation that has roughly the latency I observed with roughly the concurrency I had, but it doesn't seem to trigger it, sorry. Let me know if there is anything else I can do to help

@MakMukhi
Copy link
Contributor

MakMukhi commented Feb 26, 2018 via email

@immesys
Copy link
Author

immesys commented Mar 1, 2018

I didn't succeed in making a standalone reproducer but I can reliably reproduce it in-situ even on completely different hardware. More than willing to help track this down, just let me know how.

@MakMukhi
Copy link
Contributor

MakMukhi commented Mar 1, 2018

Hey, that's great that you can reliably reproduce it in your set up.
Following is what we know:

Here's where the client sends a bdp ping:

t.controlBuf.put(bdpPing)

As you see this is inside of the data handler which is called every time the client gets a dataFrame.
Moreover,client sends ping only if it already got an ack back from a previous ping.

On the server-side whenever a dataFrame is sent the server resets the ping counter:

atomic.StoreUint32(&t.resetPingStrikes, 1)

  • Client sees data and sends ping.
  • Client sees more data doesn't send ping.
  • Client sees more data doesn't send ping.
  • Server sees ping, but the counter was reset.
  • Server sends back ping-ack and now sets the ping strike counter again.
  • Client sees ping-ack and records bdp.
  • The whole process is repeated.

I would start with making sure the faulty pings are always bdp pings. (Assuming you can update your local code):

  1. Make sure there's nothing else that's pinging the server: In the server's handlePing, right after this line

I'd add the following:

if f.Data != bdpPing.data {
  panic("Some other pinger?")
}
  1. Perhaps, print out time whenever server sends data and receives ping.

I'm sorry I have not been able to find out anything obvious yet. But my best guess so far is that something else is pinging the server.
Let's see what you find and we'll go from there. I'd like to stare at these print statements and logs once you have them.

Also, thanks for helping out. :)

@dfawley
Copy link
Member

dfawley commented Mar 1, 2018

@immesys, Random thought along the lines of @MakMukhi's reasoning: you aren't connecting through any kind of proxy, are you?

@immesys
Copy link
Author

immesys commented Mar 1, 2018

@dfawley no, there is no http2 proxy involved. I was initially running in kubernetes, but I have reproduced it on bare metal with no fancy networking at all.

EDIT: actually I lie there is probably some overlay networking still going on, but that would be at layer 3, nothing should be injecting HTTP2 pings

@dfawley
Copy link
Member

dfawley commented Mar 1, 2018

Hmm, sounds like there should be no pings in the system except BDP pings, then. Hopefully you can try the suggestions from @MakMukhi and get back to us. Thanks!

@immesys
Copy link
Author

immesys commented Mar 1, 2018

Ok I reproduced it with the debug statements in place.
Here are the changes I made https://gist.github.com/immesys/be704af11354ba943f35b801abc9caf8
Here are the 2000 lines preceding the event. I included two files, one with all the log output and one filtered to only include log output from the specific http2Server that tripped (there are 10 concurrent connections): https://gist.github.com/immesys/aff2b454b9e0eab843e062e76bfb65c3

Let me know if I can re-run this with different print statements

@immesys
Copy link
Author

immesys commented Mar 1, 2018

@dfawley yeah in the entire log the OTHER PING DATA print statement never triggers, so I think this is just BDP pings.

@immesys
Copy link
Author

immesys commented Mar 1, 2018

Is there a reason the ping strikes are only reset when sending headers in itemHandler? Why not when sending dataFrame?

@immesys
Copy link
Author

immesys commented Mar 1, 2018

To save you reading the whole logs, here is the important sequence:

1519942464739172903 BDPDEBUG[http2Server=0xc423778f00] writing data1
1519942464739261537 BDPDEBUG[http2Server=0xc423778f00] handlePing
1519942464739266862 BDPDEBUG[http2Server=0xc423778f00] addPingStrikes2
1519942464739318578 BDPDEBUG[http2Server=0xc423778f00] writing data1
1519942464739452781 BDPDEBUG[http2Server=0xc423778f00] handlePing
1519942464739459106 BDPDEBUG[http2Server=0xc423778f00] addPingStrikes2
1519942464739535670 BDPDEBUG[http2Server=0xc423778f00] writing data1
1519942464739659435 BDPDEBUG[http2Server=0xc423778f00] handlePing
1519942464739674955 BDPDEBUG[http2Server=0xc423778f00] addPingStrikes2
1519942464739678469 BDPDEBUG[http2Server=0xc423778f00] >maxPingStrikes

Basically because we are writing three data frames and no header frames, we don't reset the ping strikes.

@MakMukhi
Copy link
Contributor

MakMukhi commented Mar 1, 2018 via email

@immesys
Copy link
Author

immesys commented Mar 1, 2018

Should it be before writing it to the wire or just after (the header is just after) ?

@MakMukhi
Copy link
Contributor

MakMukhi commented Mar 1, 2018 via email

@immesys
Copy link
Author

immesys commented Mar 1, 2018

It's been running for a while with no problems. I'll run it overnight and see. Does this completely remove the race condition or just make it harder to hit?

@immesys
Copy link
Author

immesys commented Mar 2, 2018

Still didn't hit this case now after four hours. I stopped the test because I needed the machine for a different test. Is it always the case that every data frame gets an interleaved reply 1:1 ? You can't queue a few frames in the TCP window, clearing the reset strikes after each and then receive the three pings in response all together? I don't know the code, just wondering

@MakMukhi
Copy link
Contributor

MakMukhi commented Mar 2, 2018 via email

@dfawley
Copy link
Member

dfawley commented Mar 2, 2018

Is there any way to figure out why this was happening to @immesys so reproducibly, but none of our tests encountered it?

@immesys
Copy link
Author

immesys commented Mar 2, 2018

Well it only happens after a few minutes under very high load. I think it has to do with the order that goroutines get scheduled. We could probably make a reproducer by putting in some sleeps in the code in the right places

@praserx
Copy link

praserx commented Mar 7, 2018

Similar/same issue here. We've got a one client, one 'relay' server and one server. If we create many concurrent requests (thousands as its mentioned above), we got same error on client:

rpc error: code = Unavailable desc = transport is closing

Thanks to @MakMukhi

... try turning it off by setting WithInitialWindowSize and WIthInitialConnectionWindowSize dial options on the client.

That solve our problem.

@MakMukhi
Copy link
Contributor

MakMukhi commented Mar 7, 2018

@praserx We'd ideally like BDP estimator to be turned on. Is it possible for you to turn it back on and patch #1905 to see if that fixes your problem as well.

@praserx
Copy link

praserx commented Mar 8, 2018

@MakMukhi It seems that it is working perfectly. Thank you for response.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
4 participants