-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected transport closing: too many pings from client #1882
Comments
Sorry, I can reproduce this even without the GZIP compressor. I still don't know how to debug it though. The only error I get on the client with loglevel set to info is:
|
Turn on transport level logs by setting these two env variables:
export GRPC_GO_LOG_VERBOSITY_LEVEL=2
export GRPC_GO_LOG_SEVERITY_LEVEL=info
May be the transport on the server of client side sees some error and
closes.
…On Sat, Feb 24, 2018 at 4:26 PM, Michael Andersen ***@***.***> wrote:
Sorry, I can reproduce this even without the GZIP compressor. I still
don't know how to debug it though. The only error I get on the client with
loglevel set to info is:
pickfirstBalancer: HandleSubConnStateChange: 0xc421bc6160, TRANSIENT_FAILURE
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1882 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ATtnR6yLJFRNEv-Nn5h7ILZfqwbUZoMFks5tYKisgaJpZM4SSF8Z>
.
|
The client respects those variables, but the server does not, perhaps the etcd client in the server (which uses grpc for its own stuff) is overriding it? Is there a way to set the log level in the code? EDIT: I see, etcd calls |
I finally got a new piece of information. The server prints this shortly before the client disconnects:
What does "too many pings" mean? The client prints this with the new verbosity:
|
Incidentally, neither the server nor the client passes any keep alive parameters, everything is default |
The client is pinging the server too many times. Perhaps the keepalive
settings on the client and the server are not the same.
…On Sat, Feb 24, 2018, 7:18 PM Michael Andersen ***@***.***> wrote:
I finally got a new piece of information. The server prints this shortly
before the client disconnects:
ERROR: 2018/02/24 19:16:02 transport: Got too many pings from the client, closing the connection.
ERROR: 2018/02/24 19:16:02 transport: Error while handling item. Err: connection error: desc = "transport is closing"
What does "too many pings" mean?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1882 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ATtnR11QehqgSkWt1PprRu28OaEXAN7Lks5tYND3gaJpZM4SSF8Z>
.
|
But both sides are built with the same version of grpc, neither specify any keepalive parameters and it only happens under heavy load. |
If it's not keepalive then the only thing inside gRPC that sends pings is bdp estimator which will be triggered on receiving data. However, every time the server sends data frames it resets its pingStrike counter. To quickly check if it's the bdp estimator or not, try turning it off by setting |
Thanks for the suggestion. I have set both of those options (to 1MB) and will leave it running overnight. It typically takes somewhere between 2 minutes and 20 minutes under load to trigger, so a few hours should be conclusive. |
It's been running without any problem for about 9 hours, so I think that confirms that the BDP estimator is at fault. I know nothing about how pings or BDP estimation works inside grpc, but surely if the pings are sent when data goes from client to the server and the pingStrike counter is only reset when data is sent the other way then if the RPC operations spike in latency (e.g mine go to hundreds of milliseconds) then you can exceed the counter simply because the server isn't yet responding to any RPC calls so isn't doing any resets? |
A bdp estimator ping is sent by the client when it receives data from the server. And the server must have reset its counter for that dataFrame. In an old version the server was buggy and was resetting its counter only once for a message but you don't seem to be using that. |
I made a simple reproducer where a program connects to itself and invokes an operation that has roughly the latency I observed with roughly the concurrency I had, but it doesn't seem to trigger it, sorry. Let me know if there is anything else I can do to help |
Thanks for trying to reproduce it. We might need to collaborate to debug it
further. I'll look at the code again tomorrow to see if I can find any
obvious mistakes.
…On Sun, Feb 25, 2018, 4:17 PM Michael Andersen ***@***.***> wrote:
I made a simple reproducer where a program connects to itself and invokes
an operation that has roughly the latency I observed with roughly the
concurrency I had, but it doesn't seem to trigger it, sorry. Let me know if
there is anything else I can do to help
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1882 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ATtnR2G2PhzC3C28Mkgu9vdg9P-T_2Zgks5tYfgigaJpZM4SSF8Z>
.
|
I didn't succeed in making a standalone reproducer but I can reliably reproduce it in-situ even on completely different hardware. More than willing to help track this down, just let me know how. |
Hey, that's great that you can reliably reproduce it in your set up. Here's where the client sends a bdp ping: grpc-go/transport/http2_client.go Line 840 in f0a1202
As you see this is inside of the data handler which is called every time the client gets a dataFrame. On the server-side whenever a dataFrame is sent the server resets the ping counter: grpc-go/transport/http2_server.go Line 901 in f0a1202
I would start with making sure the faulty pings are always bdp pings. (Assuming you can update your local code):
I'd add the following: if f.Data != bdpPing.data {
panic("Some other pinger?")
}
I'm sorry I have not been able to find out anything obvious yet. But my best guess so far is that something else is pinging the server. Also, thanks for helping out. :) |
@dfawley no, there is no http2 proxy involved. I was initially running in kubernetes, but I have reproduced it on bare metal with no fancy networking at all. EDIT: actually I lie there is probably some overlay networking still going on, but that would be at layer 3, nothing should be injecting HTTP2 pings |
Hmm, sounds like there should be no pings in the system except BDP pings, then. Hopefully you can try the suggestions from @MakMukhi and get back to us. Thanks! |
Ok I reproduced it with the debug statements in place. Let me know if I can re-run this with different print statements |
@dfawley yeah in the entire log the |
Is there a reason the ping strikes are only reset when sending headers in |
To save you reading the whole logs, here is the important sequence:
Basically because we are writing three data frames and no header frames, we don't reset the ping strikes. |
They are reset while scheduling dataFrame here:
https://github.com/grpc/grpc-go/blob/master/transport/http2_server.go#L901
But I think you're right if we moved this to just before writing dataFrame
on the wire it would work out. Can you try doing that?
…On Thu, Mar 1, 2018 at 2:41 PM, Michael Andersen ***@***.***> wrote:
To save you reading the whole logs, here is the important sequence:
1519942464739172903 BDPDEBUG[http2Server=0xc423778f00] writing data1
1519942464739261537 BDPDEBUG[http2Server=0xc423778f00] handlePing
1519942464739266862 BDPDEBUG[http2Server=0xc423778f00] addPingStrikes2
1519942464739318578 BDPDEBUG[http2Server=0xc423778f00] writing data1
1519942464739452781 BDPDEBUG[http2Server=0xc423778f00] handlePing
1519942464739459106 BDPDEBUG[http2Server=0xc423778f00] addPingStrikes2
1519942464739535670 BDPDEBUG[http2Server=0xc423778f00] writing data1
1519942464739659435 BDPDEBUG[http2Server=0xc423778f00] handlePing
1519942464739674955 BDPDEBUG[http2Server=0xc423778f00] addPingStrikes2
1519942464739678469 BDPDEBUG[http2Server=0xc423778f00] >maxPingStrikes
Basically because we are writing three data frames and no header frames,
we don't reset the ping strikes.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1882 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ATtnR9_TJ7joHxuljxqyp2kjDJH0pojtks5taHkAgaJpZM4SSF8Z>
.
|
Should it be before writing it to the wire or just after (the header is just after) ? |
I'd put it right before, but it shouldn't matter as much.
…On Thu, Mar 1, 2018 at 2:52 PM, Michael Andersen ***@***.***> wrote:
Should it be before writing it to the wire or just after (the header is
just after) ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1882 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ATtnRw5oyiJb42TnfzK7L8UjQXon4iEtks5taHuSgaJpZM4SSF8Z>
.
|
It's been running for a while with no problems. I'll run it overnight and see. Does this completely remove the race condition or just make it harder to hit? |
Still didn't hit this case now after four hours. I stopped the test because I needed the machine for a different test. Is it always the case that every data frame gets an interleaved reply 1:1 ? You can't queue a few frames in the TCP window, clearing the reset strikes after each and then receive the three pings in response all together? I don't know the code, just wondering |
I suspect the problem was a ping ack being scheduled between the two steps
where write method resets the counter and schedules data. This scheduling
of ping-ack sets the counter back.
If this indeed was the case then moving the reset instruction in
itemHandler will eliminate the race.
Multiple data frames get queued up in gRPC itself and I'm sure in the OS
kernel too.
…On Thu, Mar 1, 2018 at 6:24 PM, Michael Andersen ***@***.***> wrote:
Still didn't hit this case now after four hours. I stopped the test
because I needed the machine for a different test. Is it always the case
that every data frame gets an interleaved reply 1:1 ? You can't queue a few
frames in the TCP window, clearing the reset strikes after each and then
receive the three pings in response all together? I don't know the code,
just wondering
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1882 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ATtnR81g0qy4vfeZaUvmQ3YinIBywbxGks5taK1mgaJpZM4SSF8Z>
.
|
Is there any way to figure out why this was happening to @immesys so reproducibly, but none of our tests encountered it? |
Well it only happens after a few minutes under very high load. I think it has to do with the order that goroutines get scheduled. We could probably make a reproducer by putting in some sleeps in the code in the right places |
Similar/same issue here. We've got a one client, one 'relay' server and one server. If we create many concurrent requests (thousands as its mentioned above), we got same error on client:
Thanks to @MakMukhi
That solve our problem. |
@MakMukhi It seems that it is working perfectly. Thank you for response. |
What version of gRPC are you using?
583a630
What version of Go are you using (
go version
)?1.10
What operating system (Linux, Windows, …) and version?
Linux AMD64, Kernel 4.10
What did you do?
When I have the server configured with GZIP compression as so:
Then when serving thousands of concurrent requests a second, clients will occasionally be disconnected with
I see no errors from the server, and the both the client and server are far from overloaded (<10% CPU usage etc). Not all clients are affected at once, it will just be one connection which gets this error.
While trying to debug this, I disabled GZIP compression so I could use more easily look at packet captures. I am unable to reproduce this error once the GZIP compressor is no longer in use.
This issue is mostly to ask what the best way to proceed with diagnosing the problem is, or if there are any reasons why having a compressor would change the behavior of the system (aside from CPU usage which I don't think is a problem).
The text was updated successfully, but these errors were encountered: