Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

Http2: Fix remote server tests occassionally timing out in CI #39588

Closed
wants to merge 2 commits into from

Conversation

eiriktsarpalis
Copy link
Member

Increases HttpClient timeout from the default 100 seconds to 150, in order to address functional tests occasionally timing out in CI. Fixes #39545.

@eiriktsarpalis
Copy link
Member Author

Should hold off merge until CI infra issues are mitigated

@stephentoub
Copy link
Member

We can try increasing the timeout, but 100 seconds is already a very long time. If they continue to fail sporadically after this increase, there is almost certainly something more fundamental going wrong that needs to be investigated.

@eiriktsarpalis
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@eiriktsarpalis
Copy link
Member Author

I'm seeing this error in the Debian9 build:

System.Net.Http.CurlException : Stream error in the HTTP/2 framing layer

Not related to SocketsHttpHandler but thought I'd point it out. c.f. https://dev.azure.com/dnceng/public/_build/results?buildId=270647&view=ms.vss-test-web.build-test-results-tab&runId=7436544&resultId=100020&paneView=debug

@eiriktsarpalis
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@geoffkizer
Copy link

We can try increasing the timeout, but 100 seconds is already a very long time. If they continue to fail sporadically after this increase, there is almost certainly something more fundamental going wrong that needs to be investigated.

I agree it's worth understanding better. That said, at least in the case of PostAsync_RedirectWith307_LargePayload, this has been a longstanding issue even without HTTP2. See e.g. https://github.com/dotnet/corefx/issues/23768 and the discussion and linked issues there. In fact I saw this test fail once in CI for HTTP/1.1 on my PR here.

For that test at least, I think we should just increase the timeout. We understand why it runs slowly.

For the other test (PostRewindableContentUsingAuth_NoPreAuthenticate_Success), I don't think we understand why this runs slowly. Seems worth understanding this better. Looking at this again, I'm slightly worried there's a product issue here related to receiving an early 401 response, but I have no idea if that's actually happening or not.

@@ -133,6 +133,11 @@ protected HttpClient CreateHttpClientForRemoteServer(Configuration.Http.RemoteSe

var client = new HttpClient(wrappedHandler);
SetDefaultRequestVersion(client, remoteServer.HttpVersion);
if (timeout.HasValue)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than pass this in and set it here, can we just have callers set it directly on the HttpClient? This is generally what we do for other settings on HttpClient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, that should simplify the changes significantly.

@@ -2276,7 +2268,10 @@ public async Task PostAsync_Redirect_LargePayload_Helper(Configuration.Http.Remo
// Compute MD5 of request body data. This will be verified by the server when it receives the request.
content.Headers.ContentMD5 = TestHelper.ComputeMD5Hash(contentBytes);

using (HttpClient client = CreateHttpClientForRemoteServer(remoteServer))
// Issue #39545: test occassionally timing on CI using default timeout of 100 seconds.
Copy link

@geoffkizer geoffkizer Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we're leaving the issue open, we shouldn't tag the issue like this. Otherwise it looks like it's an ActiveIssue. It's fine to link to the issue for more info, but we should comment something like:

"Because [reasons], test can take very long to run, particularly in CI. See github issue [link] for more information."

That said, there isn't really any additional info in the github issue other than "this test takes a long time" so I'm not sure it's actually worth linking. The comment itself is pretty self explanatory.

Same comment below in the other test.

(edited for clarity)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will adjust wording.

@eiriktsarpalis
Copy link
Member Author

For the other test (PostRewindableContentUsingAuth_NoPreAuthenticate_Success), I don't think we understand why this runs slowly. Seems worth understanding this better. Looking at this again, I'm slightly worried there's a product issue here related to receiving an early 401 response, but I have no idea if that's actually happening or not.

I think we should merge the changes so that the test can run a few times in CI. That should give us more data on what could be happening.

@stephentoub
Copy link
Member

We understand why it runs slowly.

@geoffkizer, can you clarify? I'm not seeing the explanation in the issue, though I'm probably just missing it. What about the test causes it to take > 100 seconds?

@eiriktsarpalis
Copy link
Member Author

eiriktsarpalis commented Jul 18, 2019

Tests are failing in Debian & Redhat despite the increase in timeouts. Surfaced errors are CurlExceptions.

@geoffkizer
Copy link

@geoffkizer, can you clarify? I'm not seeing the explanation in the issue, though I'm probably just missing it. What about the test causes it to take > 100 seconds?

I should clarify; we know why this test is at least somewhat slow. Per linked issue:

Based off of these two lines it seems like we're sending just over 22 megabytes of content along with this POST.

This doesn't fully explain why it takes > 100 seconds. That said, it seems like there is a lot of general slowness that comes from (a) CI and (b) remote servers that, combined with the above, could reasonably make this test take > 100 seconds on occasion.

It would be interesting to run the test in a loop locally (but still against the remote server) and see how long it takes, on average and worst case. If these times look reasonable, then I think we can conclude that it's reasonable to bump up the time to deal with CI slowness. If they don't, then we should probably investigate where the slowness is -- first suspect being remote servers.

@geoffkizer
Copy link

I wonder what sort of bandwidth limitations the remote servers in Azure have. Given the way CI works -- i.e. launch the same tests in parallel for a bunch of different platforms -- it may well be that CI is inducing some bandwidth limits that lead to more frequent timeouts.

@davidsh Do you know?

@davidsh
Copy link
Contributor

davidsh commented Jul 18, 2019

I wonder what sort of bandwidth limitations the remote servers in Azure have. Given the way CI works -- i.e. launch the same tests in parallel for a bunch of different platforms -- it may well be that CI is inducing some bandwidth limits that lead to more frequent timeouts.
@davidsh Do you know?

There are no bandwidth limitations that I'm aware of. And looking at the Azure AppInsights logs don't show any particular errors lately.

@geoffkizer
Copy link

There are no bandwidth limitations that I'm aware of. And looking at the Azure AppInsights logs don't show any particular errors lately.

I thought every Azure box came with some sort of max networking bandwidth. For example, there's some description of networking bandwidth for different VM sizes here: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-general

Do you know what sort of VMs we are using for these servers?

If we were being limited by Azure VM bandwidth, would this show in logs?

@davidsh
Copy link
Contributor

davidsh commented Jul 18, 2019

They are not VMs exactly. They are Azure App Service. We don't specify VM size specifically.

We specify "plan size":
image

If we were being limited by Azure VM bandwidth, would this show in logs?

I don't know exactly. Would have to investigate this.

@davidsh
Copy link
Contributor

davidsh commented Jul 19, 2019

There are some interesting latency spikes in the POST upload uri's. Not sure if this is related to any CI failures or timeouts:

corefx-net-http11 - Performance
image

@geoffkizer
Copy link

What do you see for 99th percentile?

If we're occasionally hitting bandwidth limits due to CI slamming the server all at once, I'd expect to see some bad numbers in the 99th percentile.

@davidsh
Copy link
Contributor

davidsh commented Jul 19, 2019

What do you see for 99th percentile?
If we're occasionally hitting bandwidth limits due to CI slamming the server all at once, I'd expect to see some bad numbers in the 99th percentile.

image

@stephentoub
Copy link
Member

stephentoub commented Jul 19, 2019

I put it in a loop. In < 100 iterations for each of a PlatformHandler, SocketsHttpHandler 1.1, and SocketsHttpHandler 2.0 test, I got hangs, e.g. taking greater than 10 minutes. They're all stuck trying to write to the network, which suggests (beyond that this timeout increase isn't going to help) that the connection is full because the server has simply stopped reading (or is running slow enough to be equivalent). @davidsh, where's the code for the endpoint these talk to? Or maybe there is some kind of per minute ingestion limit after which it simply stops receiving? Or something like that?

@stephentoub
Copy link
Member

Actually, ignore my previous comment. I think I misinterpreted some data from xunit. Will follow-up...

@stephentoub
Copy link
Member

Been running it in a loop for around 20 minutes. It's oddly bi-modal, with every request taking ~1.3 seconds or ~8.5 seconds (it starts out mostly the former with a few of the latter thrown in, and then over time switches to being mostly the latter with a few of the former thrown in). I've not seen anything more than 9 seconds.

@davidsh
Copy link
Contributor

davidsh commented Jul 19, 2019

@davidsh, where's the code for the endpoint these talk to?

These endpoints:
http://corefx-net-http11.azurewebsites.net
http://corefx-net-http2.azurewebsites.net

deployed this source code:
https://github.com/davidsh/corefx-net-endpoint

@davidsh
Copy link
Contributor

davidsh commented Jul 19, 2019

They're all stuck trying to write to the network, which suggests that the connection is full because the server has simply stopped reading (or is running slow enough to be equivalent).

Keep in mind that there is a lot of network infrastructure in front of our endpoints. Azure App Service runs a load balanced front-end proxy etc. So, any "slowdown" or "congestion" could come from a variety of places in the path from client to server.

@geoffkizer
Copy link

Been running it in a loop for around 20 minutes.

It would be interesting to run a bunch of instances in parallel. That's what CI is effectively doing.

@stephentoub
Copy link
Member

I did yesterday, and it made little difference (average was still the same, worst case increased by a couple of seconds, highest I saw with 10 loops all banging on this was ~13 seconds). It's of course possible so much load could be thrown at it that worst case grew to 100 seconds, but I'm not sure why this test would be distinct in that regard. It's also possible I'm closer to the relevant data center or my machine is different or some such thing.

@eiriktsarpalis
Copy link
Member Author

eiriktsarpalis commented Jul 19, 2019

It would be interesting to run a bunch of instances in parallel. That's what CI is effectively doing.

I just finished running System.Net.Http.Functional.Tests.PlatformHandler_PostScenarioTest.PostRewindableContentUsingAuth_NoPreAuthenticate_Success 10000 times throttling to 50 concurrent requests on Windows. Here's the percentile data:

count=10000 p50=1.45362685 p99=3.649400295 p999=100.0050321996 p9999=100.75617315592999 max=100.7857295

It basically seems that most requests have latencies within the p99 figure, and the rest just time out at 100 seconds.

@stephentoub
Copy link
Member

and the rest just time out at 100 seconds.

What happens if you set HttpClient.Timeout to be infinite?

@eiriktsarpalis
Copy link
Member Author

Let me give it a try

@eiriktsarpalis
Copy link
Member Author

No responses yielded after 20 minutes. I will see if I can reproduce with different handlers and operating systems.

@eiriktsarpalis
Copy link
Member Author

eiriktsarpalis commented Jul 19, 2019

Here's what I got so far for PostRewindableContentUsingAuth_NoPreAuthenticate_Success:

OS Handler Protocol p50 p99 p999 max
Windows PlatformHandler http1.1 1.007 2.0007 2.3401 4.0070
Windows PlatformHandler http2 1.453 3.6494
Windows SocketsHttpHandler http1.1 2.1253 20.5400 77.8116 190.9814
Windows SocketsHttpHandler http2 1.0155 1.441 2.0272 2.2821

EDIT: for clarification this is running 10K requests throttled to 50 at a time. Every request is effectively a single test run and sets up the client context from scratch.

@karelz karelz added this to the 5.0 milestone Aug 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTTP2: Two remote server tests occasionally timing out in CI
5 participants