-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failure System.Net.WebSockets.Tests.WebSocketDeflateTests.PayloadShouldHaveSimilarSizeWhenSplitIntoSegments #52031
Comments
Tagging subscribers to this area: @dotnet/ncl Issue DetailsRun: runtime 20210428.85 Failed test:
Error message:
|
To me this issue seems unrelated to the reverted "fix ber scanf/printf". For some reason, unknown to me, on ARM architecture, this test takes more than 5 seconds to complete. I have created this test to check that deflate context takeover works as expected, but it is a bit speculative on the compression algorithm. We can either increase the timeout (or even remove it entirely and depend on xunit cancelling the test if something hangs), or delete the test, because if in the future the underlying compression ratio changes for some reason, this test will be outdated and can start to fail. //cc @CarnaViire |
Timeouts like that tend to be unreliable in our CI, if the test is executed on a heavily loaded machine, it may take more than expected... That being said, we should first gather statistics, how common is this test failure, or it was a one-off. I'll search for the stats. Btw, how could the algorithm change that much so the aggregation of compressed parts of the message will be much bigger/smaller than the compressed whole message?.. wouldn't it be a different/new algo then, not DEFLATE as it was described in RFC? |
The deflate only describes the structure/format of the data. The algorithm is implementation detail that might change (for example zlib-intel vs classic zlib), memory constraints, different performance optimizations that might trade compression ratio for speed. |
I see that it is failing quite regularly (53 times in 2 days on main) for 14 and 15 window bits for alpine.313.arm32 I'm ok with removing the test as we cannot depend on it from the algorithm's point of view, but... Interesting that it is not failing, but hanging, and it's only on arm. I'm interested in what will happen if we just remove our timeout, whether it will complete after some more time, or will continue hanging until killed as LongRunningTest 🤔 |
There is only one way to find out 😎 However I have the nagging suspicion that there is some problem with Is there any way I could run a test on ARM, or the only way would be to create e DRAFT PR that I would delete later? |
@CarnaViire I created a PR, we wait now to see the run times for the test. If I am correct, the fix should be successful. If however you feel that we should remove the test, let me know. |
Would it defeat the purpose of the test to make the message smaller, e.g. frame x5, not x10? And as for running a test on ARM, I'd say CI is the best way I know of. |
It would not. 5 is as good a number as 10 in this case 😊 Let's see how the CI build turns out, I think the issue here was the usage of |
@karelz should be able to arrange for you to get access to an ARM machine if needed. |
@CarnaViire can we run this code in one of the failing environments (alpine.313.arm32, alpine.312.arm64, ubuntu.1804.arm32, ubuntu.1804.arm64): using System;
using System.Diagnostics;
var random = new Random(0);
var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < 1_000_000; ++i)
{
random.Next(maxValue: 10);
}
stopwatch.Stop();
Console.WriteLine(stopwatch.Elapsed); It shouldn't take more than a few hundred milliseconds on the slowest silicon. On my i7 machine it takes |
@BruceForstall this was done in #52086 I agree (this is a reminder to us all 🙂) that when a test fails regularly in CI we should disable it immediately if we can't fix it immediately. We're good engineers and prefer to investigate and fix rather than disable anything, but that can happen while it's disabled. 🙂 |
I assume Viktor CC'ed me for this. We have a limited # of ARM machines that can be accessed via SSH / RDP from appropriate source IPs, you can reach out to @ilyas1974 or myself if he's unavailable to borrow a machine. |
@zlatanov I was finally able to get access to and setup an ARM machine (failed to setup alpine in docker, so using ubuntu.1804.arm64) to run the code with |
For the whole test too, I see at max |
That's wacky -- we know from Kusto that it was always failing (for 14/15). Not even flaky. That includes Ubuntu.1804.arm64. It's our own zlib, right? It's not an issue with which zlib is on the machine? |
Looks like yet another one of the cases where chasing down test failures in CI is difficult without having access to the exact CI machine setup :( ... is there a way to get temporary access to one of the CI machines to confirm we can reproduce it there manually at least? |
As I understand it, I actually did use one of the CI machines that was excluded from the rotation for me: https://github.com/dotnet/core-eng/issues/13006. But because it's excluded, it's not getting the same load it did while it was in CI, so maybe that's the reason... I also ended up creating my own docker image for Ubuntu 18.04, but I doubt it could influence it that much? Or could it? |
Thanks @CarnaViire! I still don't think this is related to zlib. @danmoseley we don't use machine wide installation of zlib. In case of ARM, we use the classic zlib instead of zlib-intel. I will look at the build/tests outputs to see if I can find something interesting. As @CarnaViire suggested it might be related to how busy is the machine, but from |
After looking again in the build logs, I don't think it's load related either. The run times are too consistent - almost If it was load related, I expect we would have deviations, but we don't. |
I spoke imprecisely, I meant when it failed, it was always 14 or 15, try this query: Execute: Web | Desktop | Web (Lens) | Desktop (SAW) https://engsrvprod.kusto.windows.net/engineeringdata TestResults
| join kind=inner WorkItems on WorkItemId
| join kind=inner Jobs on JobId
| where Finished >= datetime(2021-3-1 0:00:00)
and Type == "System.Net.WebSockets.Tests.WebSocketDeflateTests"
and Branch == "refs/heads/main"
| summarize count() by Method, Result, QueueAlias, Arguments, Message |
Which logs are these? I didn't think we retained data for individual tests unless they failed, and I thought you couldn't repro locally. Something else I see from querying the failures: it never fails on 13, and it fails on 15 about twice as much as 14. When it fails for 14, it always fails for 15. (This is consistent with the "machine mysteriously slow" theory) One other idea -- roll back #52086 and add a Stopwatch around each part of the test, including the Random loop, to measure how long each part takes. Put a try/catch for TaskCanceledException around the whole test, and if it's hit, then add those sub-times into the message. That will tell us where the time was spent. Ideally we'd do this locally on @CarnaViire 's ARM machine and just run the test over and over that way ? |
This probably can be done with a simple console app, no xunit involved. |
The simple console app doesn't help - it consistently shows the small numbers I've posted before, like 0.03 for random and 0.3 for the rest... now I'm trying to build and run the actual full test suit, possibly xunit parallel test run influences it somehow. |
So, even when running all WebSockets tests with parallel execution, I am getting 0.34-0.35s for 15 window bits for this test. I guess we have to try executing it in actual CI pipeline... which I am not sure how to do because CI seems to skip test run on ARM e.g. here #52417 (it says it's because isFullMatrix==false apparently) and I don't know how to trigger it 😢 |
The build logs seem to be retained for 60 days. If you go to the tests and filter them you can find the websocket tests for each of the platform. The run times are too consistent. You can try and find older builds (PRs that mentioned this issue) and you will see similar run times. And because I originally used 5 seconds as cancellation timeout, having the test throw |
You can run it manually from your branch. Just click on "Run Pipeline" here: https://dev.azure.com/dnceng/public/_build?definitionId=686 Then, on the source branch choose your branch... you can push a branch to the dotnet fork of dotnet/runtime and then it will show up as available to run the pipeline from; or if you have a PR you can use |
I've found the difference between what I was running on my machine and what was failing in CI. I was running on Release runtime, but the test actually ran overtime on Checked runtime. |
I was racking my brain to think what we were missing, and you figured it out. 👏 |
I've run the measurements multiple times on Checked runtime and it was indeed Random taking all the time there: random part takes |
@jkotas does it surprise you that Random is orders of magnitude slower on a checked runtime on ARM? |
I do not see an obvious reason why it should be that much slower compared to release build. |
Reducing the number of times Random.Next is called to improve runtime performance of test on ARM. Fixes #52031
Run: runtime 20210428.85
Failed test:
Error message:
The text was updated successfully, but these errors were encountered: