-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outbound calls timing out with context deadline exceeded
when TCP connection is been drop
#21346
Comments
@chlowell This is not Cosmos exclusive, Cosmos does not customize the HTTP layer, it is common across all SDKs through azcore. The point of this report is that there is a problem in Go that in order to address it, the core HTTP pipeline needs to have its settings adjusted. |
Thanks @ealsur, you're quite right, I read this too superficially when triaging it |
We're encountering this bug as well causing unavailability of the client for 15-20 minute periods. In our case for Azure Key Vault. |
We're investigating the issue and will update as we have more info. |
While we investigate, you should be able to work around the issue by providing your own HTTP client with the health check enabled. Here's what this would look like for a Key Vault client. // copied from azcore's default transport policy
customTransport := &http.Transport{
Proxy: http.ProxyFromEnvironment,
DialContext: (&net.Dialer{
Timeout: 30 * time.Second,
KeepAlive: 30 * time.Second,
}).DialContext,
ForceAttemptHTTP2: true,
MaxIdleConns: 100,
IdleConnTimeout: 90 * time.Second,
TLSHandshakeTimeout: 10 * time.Second,
ExpectContinueTimeout: 1 * time.Second,
TLSClientConfig: &tls.Config{
MinVersion: tls.VersionTLS12,
Renegotiation: tls.RenegotiateFreelyAsClient,
},
}
if http2Transport, err := http2.ConfigureTransports(customTransport); err == nil {
// if the connection has been idle for 10 seconds, send a ping frame for a health check
http2Transport.ReadIdleTimeout = 10 * time.Second
// if there's no response to the ping within 2 seconds, close the connection
http2Transport.PingTimeout = 2 * time.Second
}
client, err := azkeys.NewClient("<your vault>", cred, &azkeys.ClientOptions{
ClientOptions: azcore.ClientOptions{
Transport: &http.Client{
Transport: customTransport,
},
},
}) |
Note that the workaround might make things worse if running in an environment where ICMP is blocked (e.g. firewall). |
And to be clear, the root cause is a bug in the Go standard library. What we're evaluating here is a workaround until the above has been fixed. |
Why would that be the case? Note: an HTTP/2 ping frame != an ICMP ping. |
I admit this is outside my area of expertise. You have a link (RFC etc) where I can learn more about this? |
Thanks this is helpful. There's a conversation on the PR where concern was raised about the server not supporting HTTP/2 pings. Is that a real possibility, or would a spec compliant HTTP/2 implementation be required to support ping frames? |
I suppose a badly implemented http/2 server could not implement ping, but it should really not be in production hosting an azure service then? :D http2 spec defines 10 frames : https://httpwg.org/specs/rfc7540.html#rfc.section.11.2 I hope servers that claim supporting http/2 support the core 10 frames. There is nothing stating that they are optional |
No harm making this optional. Even if most of Azure endpoint should support this. On my side I was able to quickly validate if a server does support http2 ping using this python script: import socket
import ssl
import h2.connection
import h2.events
import certifi
import sys
SERVER_NAME = "julienstestcosmos.documents.azure.com"
SERVER_PORT = 443
socket.setdefaulttimeout(15)
ctx = ssl.create_default_context(cafile=certifi.where())
ctx.set_alpn_protocols(['h2'])
s = socket.create_connection((SERVER_NAME, SERVER_PORT))
s = ctx.wrap_socket(s, server_hostname=SERVER_NAME)
c = h2.connection.H2Connection()
c.initiate_connection()
c.ping(b'ffffffff')
s.sendall(c.data_to_send())
response_stream_ended = False
while not response_stream_ended:
data = s.recv(65536 * 1024)
if not data:
break
events = c.receive_data(data)
for event in events:
if isinstance(event, h2.events.PingAckReceived):
print(event)
response_stream_ended = True
break
|
@julienstroheker @jim-minter @sgmiller is there a particular time of day you see the timeouts occur more frequently than others? In our case, for Cosmos DB in particular, we always see connections being dropped or going bad around 23:00 UTC. |
If we want to make it configurable, we have a few options.
Conceptually, it seems like this is something we'd always want to enable, and I have a hard time believing that we have Azure hosts that don't support HTTP/2 ping. |
Or perhaps we expose options for |
this is random for us and happening more often on busier region |
@jhendrixMSFT don't have a strong opinion. Maybe you can expose it as optional for now, gain confidence then default it ? I also feel that having a wiki page might help folks to understand to usage of this. The page can also contains clients (endpoints) that we know so far do support HTTP2 ping (cosmos, event grid, arm, etc...) |
I don't think any http server that support http2 would not have support ping frames. It's part of the spec. |
Agreed. If there is a default, I would default on, and imho the best
interface would be to somehow expose all the knobs as options, with
sensible defaults.
…On Sat, Sep 30, 2023 at 11:18 PM Stéphane Erbrech ***@***.***> wrote:
I don't think any http server that support http2 would not have support
ping frames. It's part of the spec.
—
Reply to this email directly, view it on GitHub
<#21346 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJNHLUO43P5JYBJAXVOTYDX5D4JRANCNFSM6AAAAAA3K6WA7E>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
You'll be surprised the number of servers I saw in the past not supporting ping frames lol. But all Azure Endpoints should respect the RFC. This should be defaulted, if there is an issue with it, the teams owning the DP should fix it, not the client/caller. |
Fix will be including in November's release of |
@jhendrixMSFT just want to mention that we are seeing issue by using HTTP/2 ping with eventgrid client. We reverted back our transport to use http/1 for now. Might not be 100% related but the revert mitigated our issues for now... |
@julienstroheker thanks for the update. Do you have any additional details you can share about how this manifests? |
Bug Report
Opening this for visibility in case other folks are having this issue !
TLDR : golang/go#59690
When using HTTP/2, which is defaulted everywhere now, if the keep-alive connection is been dropped somehow, the sdk clients cannot recover causing some timeout
context deadline exceeded
until the connection been closed by the host.When using a HTTP/2 server, CosmosDB for example, the SDK transport should PING the server to identify is the connection is not stale.
I'm able to reproduce this using CosmosDB sample code such as :
Then run
go run main.go
Check for the port been used to talk to cosmos
netstat -anp | grep main
Drop the connection
sudo iptables -I INPUT -p tcp --dport 54420 -j DROP
where54420
is the port returned by netstat`This is what you should see
The connection should recover at some point (based on the host that will kill the keep-alive connection at some point, might take up to 15minutes)
After setting the HTTP/2 transport properly as following :
Then running it with
export GODEBUG=http2debug=2
&go run main.go
I'm able to see the connection been recover by the transport along with the runtime retrier policy :)
At least for cosmos db client, should these setting be enabled by default ?
Tested this on a vanilla Azure VM, but same issue is happening code hosted on AKS.
The text was updated successfully, but these errors were encountered: