Skip to content

Transport close after keepalive failure blocks on unresponsive reader #8425

@jgold2-stripe

Description

@jgold2-stripe

What version of gRPC are you using?

v1.71.2

What version of Go are you using (go version)?

go version go1.24.4 darwin/arm64

What operating system (Linux, Windows, …) and version?

MacOS

What did you do?

I have a connection open to a streaming RPC using client keepalives. When the network goes down, the keepalive fires as expected, but the transport still waits much longer than is necessary to finally close.

What did you expect to see?

I assumed that the transport would close almost immediately after the keepalive fails.

What did you see instead?

The transport does close, but not until the underlying network connection fails its current call to Read.

Just as some extra color, I think the issue stems from the following:

  • We detect the keepalive failure.
  • Before returning with this error, we call Close.
  • The close proceeds quickly (I verified with logging) until it blocks on readerDone.
  • The readerDone channel is closed when returning from reader(), which itself makes blocking read calls (via the golang.org/x/net/http2 package) on the underlying net.Conn.
  • While the actual network is down, TCP does not know this yet, leaving us to wait for a significant amount of time (40s in some cases) to finally close readerDone.
  • Once readerDone is closed, the transport completes its close quickly from there.

I noticed that a write deadline is set (though I was surprised to find that the error is ignored -- maybe it should be logged at least?), but not a read deadline. Testing with a local version where I also invoke SetReadDeadline, I can confirm that the transport does close quickly as I'd expect.

Is the absence of a call to SetReadDeadline intentional? I wonder if I'm missing a good reason for it. In our case, since the failure to receive a keepalive ACK necessarily indicates that the connection to the server is dead, we absolutely expect any read to time out, so would prefer not to wait unnecessarily for it. I can propose code changes here, but want to step back and see if I am misunderstanding what should be happening.

Metadata

Metadata

Assignees

Labels

Area: TransportIncludes HTTP/2 client/server and HTTP server handler transports and advanced transport features.Status: Help WantedType: Bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions