Http2: Fix remote server tests occassionally timing out in CI #39588

eiriktsarpalis · 2019-07-18T13:53:59Z

Increases HttpClient timeout from the default 100 seconds to 150, in order to address functional tests occasionally timing out in CI. Fixes #39545.

eiriktsarpalis · 2019-07-18T13:55:29Z

Should hold off merge until CI infra issues are mitigated

stephentoub · 2019-07-18T14:05:28Z

We can try increasing the timeout, but 100 seconds is already a very long time. If they continue to fail sporadically after this increase, there is almost certainly something more fundamental going wrong that needs to be investigated.

eiriktsarpalis · 2019-07-18T15:56:55Z

/azp run

azure-pipelines · 2019-07-18T15:57:19Z

Azure Pipelines successfully started running 4 pipeline(s).

eiriktsarpalis · 2019-07-18T21:09:45Z

I'm seeing this error in the Debian9 build:

System.Net.Http.CurlException : Stream error in the HTTP/2 framing layer

Not related to SocketsHttpHandler but thought I'd point it out. c.f. https://dev.azure.com/dnceng/public/_build/results?buildId=270647&view=ms.vss-test-web.build-test-results-tab&runId=7436544&resultId=100020&paneView=debug

eiriktsarpalis · 2019-07-18T21:10:09Z

/azp run

azure-pipelines · 2019-07-18T21:10:30Z

Azure Pipelines successfully started running 4 pipeline(s).

geoffkizer · 2019-07-18T21:42:28Z

We can try increasing the timeout, but 100 seconds is already a very long time. If they continue to fail sporadically after this increase, there is almost certainly something more fundamental going wrong that needs to be investigated.

I agree it's worth understanding better. That said, at least in the case of PostAsync_RedirectWith307_LargePayload, this has been a longstanding issue even without HTTP2. See e.g. https://github.com/dotnet/corefx/issues/23768 and the discussion and linked issues there. In fact I saw this test fail once in CI for HTTP/1.1 on my PR here.

For that test at least, I think we should just increase the timeout. We understand why it runs slowly.

For the other test (PostRewindableContentUsingAuth_NoPreAuthenticate_Success), I don't think we understand why this runs slowly. Seems worth understanding this better. Looking at this again, I'm slightly worried there's a product issue here related to receiving an early 401 response, but I have no idea if that's actually happening or not.

geoffkizer · 2019-07-18T21:44:00Z

src/System.Net.Http/tests/FunctionalTests/HttpClientHandlerTestBase.cs

@@ -133,6 +133,11 @@ protected HttpClient CreateHttpClientForRemoteServer(Configuration.Http.RemoteSe

            var client = new HttpClient(wrappedHandler);
            SetDefaultRequestVersion(client, remoteServer.HttpVersion);
+            if (timeout.HasValue)


Rather than pass this in and set it here, can we just have callers set it directly on the HttpClient? This is generally what we do for other settings on HttpClient.

Good point, that should simplify the changes significantly.

geoffkizer · 2019-07-18T21:46:58Z

src/System.Net.Http/tests/FunctionalTests/HttpClientHandlerTest.cs

@@ -2276,7 +2268,10 @@ public async Task PostAsync_Redirect_LargePayload_Helper(Configuration.Http.Remo
                // Compute MD5 of request body data. This will be verified by the server when it receives the request.
                content.Headers.ContentMD5 = TestHelper.ComputeMD5Hash(contentBytes);

-                using (HttpClient client = CreateHttpClientForRemoteServer(remoteServer))
+                // Issue #39545: test occassionally timing on CI using default timeout of 100 seconds.


Unless we're leaving the issue open, we shouldn't tag the issue like this. Otherwise it looks like it's an ActiveIssue. It's fine to link to the issue for more info, but we should comment something like:

"Because [reasons], test can take very long to run, particularly in CI. See github issue [link] for more information."

That said, there isn't really any additional info in the github issue other than "this test takes a long time" so I'm not sure it's actually worth linking. The comment itself is pretty self explanatory.

Same comment below in the other test.

(edited for clarity)

Thanks, will adjust wording.

eiriktsarpalis · 2019-07-18T21:54:12Z

For the other test (PostRewindableContentUsingAuth_NoPreAuthenticate_Success), I don't think we understand why this runs slowly. Seems worth understanding this better. Looking at this again, I'm slightly worried there's a product issue here related to receiving an early 401 response, but I have no idea if that's actually happening or not.

I think we should merge the changes so that the test can run a few times in CI. That should give us more data on what could be happening.

stephentoub · 2019-07-18T22:03:38Z

We understand why it runs slowly.

@geoffkizer, can you clarify? I'm not seeing the explanation in the issue, though I'm probably just missing it. What about the test causes it to take > 100 seconds?

eiriktsarpalis · 2019-07-18T23:33:33Z

Tests are failing in Debian & Redhat despite the increase in timeouts. Surfaced errors are CurlExceptions.

geoffkizer · 2019-07-18T23:41:49Z

@geoffkizer, can you clarify? I'm not seeing the explanation in the issue, though I'm probably just missing it. What about the test causes it to take > 100 seconds?

I should clarify; we know why this test is at least somewhat slow. Per linked issue:

Based off of these two lines it seems like we're sending just over 22 megabytes of content along with this POST.

This doesn't fully explain why it takes > 100 seconds. That said, it seems like there is a lot of general slowness that comes from (a) CI and (b) remote servers that, combined with the above, could reasonably make this test take > 100 seconds on occasion.

It would be interesting to run the test in a loop locally (but still against the remote server) and see how long it takes, on average and worst case. If these times look reasonable, then I think we can conclude that it's reasonable to bump up the time to deal with CI slowness. If they don't, then we should probably investigate where the slowness is -- first suspect being remote servers.

geoffkizer · 2019-07-18T23:45:46Z

I wonder what sort of bandwidth limitations the remote servers in Azure have. Given the way CI works -- i.e. launch the same tests in parallel for a bunch of different platforms -- it may well be that CI is inducing some bandwidth limits that lead to more frequent timeouts.

@davidsh Do you know?

davidsh · 2019-07-18T23:48:23Z

I wonder what sort of bandwidth limitations the remote servers in Azure have. Given the way CI works -- i.e. launch the same tests in parallel for a bunch of different platforms -- it may well be that CI is inducing some bandwidth limits that lead to more frequent timeouts.
@davidsh Do you know?

There are no bandwidth limitations that I'm aware of. And looking at the Azure AppInsights logs don't show any particular errors lately.

geoffkizer · 2019-07-18T23:53:32Z

There are no bandwidth limitations that I'm aware of. And looking at the Azure AppInsights logs don't show any particular errors lately.

I thought every Azure box came with some sort of max networking bandwidth. For example, there's some description of networking bandwidth for different VM sizes here: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-general

Do you know what sort of VMs we are using for these servers?

If we were being limited by Azure VM bandwidth, would this show in logs?

davidsh · 2019-07-18T23:58:22Z

They are not VMs exactly. They are Azure App Service. We don't specify VM size specifically.

We specify "plan size":

If we were being limited by Azure VM bandwidth, would this show in logs?

I don't know exactly. Would have to investigate this.

davidsh · 2019-07-19T00:02:16Z

There are some interesting latency spikes in the POST upload uri's. Not sure if this is related to any CI failures or timeouts:

corefx-net-http11 - Performance

geoffkizer · 2019-07-19T00:04:28Z

What do you see for 99th percentile?

If we're occasionally hitting bandwidth limits due to CI slamming the server all at once, I'd expect to see some bad numbers in the 99th percentile.

davidsh · 2019-07-19T00:24:55Z

What do you see for 99th percentile?
If we're occasionally hitting bandwidth limits due to CI slamming the server all at once, I'd expect to see some bad numbers in the 99th percentile.

stephentoub · 2019-07-19T01:08:49Z

I put it in a loop. In < 100 iterations for each of a PlatformHandler, SocketsHttpHandler 1.1, and SocketsHttpHandler 2.0 test, I got hangs, e.g. taking greater than 10 minutes. They're all stuck trying to write to the network, which suggests (beyond that this timeout increase isn't going to help) that the connection is full because the server has simply stopped reading (or is running slow enough to be equivalent). @davidsh, where's the code for the endpoint these talk to? Or maybe there is some kind of per minute ingestion limit after which it simply stops receiving? Or something like that?

stephentoub · 2019-07-19T01:24:28Z

Actually, ignore my previous comment. I think I misinterpreted some data from xunit. Will follow-up...

stephentoub · 2019-07-19T01:33:00Z

Been running it in a loop for around 20 minutes. It's oddly bi-modal, with every request taking ~1.3 seconds or ~8.5 seconds (it starts out mostly the former with a few of the latter thrown in, and then over time switches to being mostly the latter with a few of the former thrown in). I've not seen anything more than 9 seconds.

davidsh · 2019-07-19T01:45:05Z

@davidsh, where's the code for the endpoint these talk to?

These endpoints:
http://corefx-net-http11.azurewebsites.net
http://corefx-net-http2.azurewebsites.net

deployed this source code:
https://github.com/davidsh/corefx-net-endpoint

davidsh · 2019-07-19T01:46:42Z

They're all stuck trying to write to the network, which suggests that the connection is full because the server has simply stopped reading (or is running slow enough to be equivalent).

Keep in mind that there is a lot of network infrastructure in front of our endpoints. Azure App Service runs a load balanced front-end proxy etc. So, any "slowdown" or "congestion" could come from a variety of places in the path from client to server.

geoffkizer · 2019-07-19T12:51:27Z

Been running it in a loop for around 20 minutes.

It would be interesting to run a bunch of instances in parallel. That's what CI is effectively doing.

stephentoub · 2019-07-19T13:25:37Z

I did yesterday, and it made little difference (average was still the same, worst case increased by a couple of seconds, highest I saw with 10 loops all banging on this was ~13 seconds). It's of course possible so much load could be thrown at it that worst case grew to 100 seconds, but I'm not sure why this test would be distinct in that regard. It's also possible I'm closer to the relevant data center or my machine is different or some such thing.

eiriktsarpalis · 2019-07-19T13:35:51Z

It would be interesting to run a bunch of instances in parallel. That's what CI is effectively doing.

I just finished running System.Net.Http.Functional.Tests.PlatformHandler_PostScenarioTest.PostRewindableContentUsingAuth_NoPreAuthenticate_Success 10000 times throttling to 50 concurrent requests on Windows. Here's the percentile data:

count=10000 p50=1.45362685 p99=3.649400295 p999=100.0050321996 p9999=100.75617315592999 max=100.7857295

It basically seems that most requests have latencies within the p99 figure, and the rest just time out at 100 seconds.

stephentoub · 2019-07-19T13:37:33Z

and the rest just time out at 100 seconds.

What happens if you set HttpClient.Timeout to be infinite?

eiriktsarpalis · 2019-07-19T13:37:46Z

Let me give it a try

eiriktsarpalis · 2019-07-19T14:03:32Z

No responses yielded after 20 minutes. I will see if I can reproduce with different handlers and operating systems.

eiriktsarpalis · 2019-07-19T14:58:51Z

Here's what I got so far for PostRewindableContentUsingAuth_NoPreAuthenticate_Success:

OS	Handler	Protocol	p50	p99	p999	max
Windows	PlatformHandler	http1.1	1.007	2.0007	2.3401	4.0070
Windows	PlatformHandler	http2	1.453	3.6494	∞	∞
Windows	SocketsHttpHandler	http1.1	2.1253	20.5400	77.8116	190.9814
Windows	SocketsHttpHandler	http2	1.0155	1.441	2.0272	2.2821

EDIT: for clarification this is running 10K requests throttled to 50 at a time. Every request is effectively a single test run and sets up the client context from scratch.

Http2: Fix remote server tests occassionally timing out in CI

5a3b592

eiriktsarpalis requested review from davidsh and geoffkizer July 18, 2019 13:54

fix wording in comments

3a7e3a8

stephentoub approved these changes Jul 18, 2019

View reviewed changes

davidsh added the area-System.Net.Http label Jul 18, 2019

stephentoub closed this Jul 18, 2019

stephentoub reopened this Jul 18, 2019

geoffkizer reviewed Jul 18, 2019

View reviewed changes

eiriktsarpalis closed this Jul 19, 2019

karelz added this to the 5.0 milestone Aug 3, 2019

benaadams mentioned this pull request Feb 1, 2020

CI not running for PRs dotnet/runtime#30297

Closed

Http2: Fix remote server tests occassionally timing out in CI #39588

Http2: Fix remote server tests occassionally timing out in CI #39588

Conversation

eiriktsarpalis commented Jul 18, 2019

eiriktsarpalis commented Jul 18, 2019

stephentoub commented Jul 18, 2019

eiriktsarpalis commented Jul 18, 2019

azure-pipelines bot commented Jul 18, 2019

eiriktsarpalis commented Jul 18, 2019

eiriktsarpalis commented Jul 18, 2019

azure-pipelines bot commented Jul 18, 2019

geoffkizer commented Jul 18, 2019

geoffkizer Jul 18, 2019

Choose a reason for hiding this comment

eiriktsarpalis Jul 18, 2019

Choose a reason for hiding this comment

geoffkizer Jul 18, 2019 • edited Loading

Choose a reason for hiding this comment

eiriktsarpalis Jul 18, 2019

Choose a reason for hiding this comment

eiriktsarpalis commented Jul 18, 2019

stephentoub commented Jul 18, 2019

eiriktsarpalis commented Jul 18, 2019 • edited Loading

geoffkizer commented Jul 18, 2019

geoffkizer commented Jul 18, 2019

davidsh commented Jul 18, 2019

geoffkizer commented Jul 18, 2019

davidsh commented Jul 18, 2019

davidsh commented Jul 19, 2019

geoffkizer commented Jul 19, 2019

davidsh commented Jul 19, 2019

stephentoub commented Jul 19, 2019 • edited Loading

stephentoub commented Jul 19, 2019

stephentoub commented Jul 19, 2019

davidsh commented Jul 19, 2019 • edited Loading

davidsh commented Jul 19, 2019

geoffkizer commented Jul 19, 2019

stephentoub commented Jul 19, 2019

eiriktsarpalis commented Jul 19, 2019 • edited Loading

stephentoub commented Jul 19, 2019

eiriktsarpalis commented Jul 19, 2019

eiriktsarpalis commented Jul 19, 2019

eiriktsarpalis commented Jul 19, 2019 • edited Loading

geoffkizer Jul 18, 2019 •

edited

Loading

eiriktsarpalis commented Jul 18, 2019 •

edited

Loading

stephentoub commented Jul 19, 2019 •

edited

Loading

davidsh commented Jul 19, 2019 •

edited

Loading

eiriktsarpalis commented Jul 19, 2019 •

edited

Loading

eiriktsarpalis commented Jul 19, 2019 •

edited

Loading