Help needed debugging random but reproduceable travis "hangs" #367

matthid · 2019-12-16T21:02:07Z

See the related discussion in fsprojects/FAKE#2439.
While FAKE indeed makes it harder to debug the issue, the underlying issue is that Expecto is not running my tests properly.
I'm not sure if this is because of one of the tests or the expecto runner itself (but I strongly assume the later). This doesn't mean there is a bug here as it could be in any of the updated components (like the dotnet sdk 3.1) but I'm quite lost on how to debug this further and I hope to get some insights here. This is what I have tried:

added a timeout to every single test with the code in fsprojects/FAKE@d255439
Downgrading Expecto to 8.12 (as I started seeing this after paket update), but it doesn't solve the problem

This is what I noticed:

Fake has 4 test projects using Expecto, sometimes only a single one "hangs" sometimes multiple. Some builds (search for EXPECTO?! Summary... and you'll see different results, never 4):
The projects "hanging" the most produce the most output/have the most tests.
Redirecting the output to a file seems to make it a bit better (but it still fails eventually)
I see this only on travis, not on gitlab, Azure DevOps, AppVeyor or locally (I even tried in WSL)
after adding --sequenced --debug I can see that often it starts hanging after any test (ie the last bit of output is similar to [20:41:15 DBG] Fake.Core.Target.Tests/Initial Targets Can Run Concurrently passed in 00:00:00.0110000. <Expecto>, examples:
- https://travis-ci.org/fsharp/FAKE/builds/625858555?utm_source=github_status&utm_medium=notification
- https://travis-ci.org/fsharp/FAKE/builds/625862052?utm_source=github_status&utm_medium=notification
But it also seems to hang after "starting", example: https://travis-ci.org/fsharp/FAKE/builds/625862034?utm_source=github_status&utm_medium=notification
Or within the tests: https://travis-ci.org/fsharp/FAKE/builds/625858583?utm_source=github_status&utm_medium=notification

Any help on how to debug this further are welcome. Note: It seems to reproduce very easily so in theory it should be "easy" to debug, only downside is that it only happens on travis.

One potential source of this issue is that I upgraded to the 3.X version of the .net core runtime and added the roll-forward policy for the test projects, but I'd assume that we want expecto to be compatible with the latest stuff.

Let me know if you need additional information.

The text was updated successfully, but these errors were encountered:

haf · 2019-12-16T22:11:29Z

I'm not sure what this might be, but reading through the actual evaluation loop:

https://github.com/haf/expecto/blob/master/Expecto/Async.fs#L44

the Async.Sleep 10 is less than a default quanta (80 ms) and is effectively a spinlock. So if you only have a single thread or it is too slow, you may livelock since the Task never gets scheduled. This is my best guess.

Maybe you can reproduce and dump if you run docker with the --numcpus option? https://docs.docker.com/config/containers/resource_constraints/#cpu This is probably what Travis does.

I will not have time to help you debug this right now, however. I can assist from the sidelines.

matthid · 2019-12-16T22:20:41Z

Yes I assumed thread starvation because of

expecto/Expecto/Async.fs

Line 26 in 48f07b3

if Task.WaitAny tasks = 0 then

but increasing the threadpool didn't actually change anything. Now I'm doing one run with --no-spinner because there seems to be quite some magic in there.
I try to get docker running but at the moment I don't have any working instance of it on my workstation :)

I will not have time to help you debug this right now, however. I can assist from the sidelines.

Thanks, to be honest I'm slowly running out of ideas and just doing some wild shots into the dark.

Since adding the --debug flag it seems to fail on gitlab as well - very interesting

haf · 2019-12-16T22:37:03Z

Just because you increase the threadpool doesn't mean you'll get more concurrency, if one thread starves the others and both are using green threads (Async.Sleep etc), since they are then failing at cooperation.

I think there are some possible edge cases here;

the line linked above
the juggling of stdout needed to do progress and colourised printing concurrently — see e.g. F/avoid mutable #354

Since it's reproducible for you, we're in a pretty good position.

You should be able to do a dump/snapshot/stacktrace and see the waitgraph when it freezes.
You can detach much of the printing by supplying a noop printer - configurable. Run a couple of hundred runs, see if it still fails

If all of this fails, I have a new major version brewing that uses Hopac instead of Async + Task that I can push if you want me to. Logary shows that Hopac is a saner model for programming concurrently -- I've actually run into this particular bug once or twice and that's when I decided this code-base needed a refactoring. I just haven't got the time to complete it (it's at about 500 LoC right now, mostly API surface area), and it's far from done.

matthid · 2019-12-17T08:15:17Z

I have 'inlined' expecto.fs and disabled some stuff (to be able to test some fixes within expecto more quickly) and changed to the silent printer. Now the build doesn't hang anymore... Let's see if we can make it hang again.

matthid · 2019-12-17T11:55:23Z

Adding the printer again makes it hang, so it is definitely that. Anything else you want me to try?
I guess I can now workaround for FAKE and add my own simple printer, which doesn't use complicated coloring and logary.

haf · 2019-12-17T13:48:21Z

It is not Logary that hangs; Logary is well tested. The progress pretty printing and locking and Tasks is not my creation. Hence the rewrite in progress.

I would like a dump of a hang if you can get it. Thanks! :) If you can help me fix the issue or contribute money towards a developer who fixes the issue, that would be helpful too.

matthid · 2019-12-17T14:13:34Z

Any idea how I can get one by scripting with a reasonable amount of time and then upload the dump somewhere within travis?

haf · 2019-12-17T14:20:32Z

You can upload the dump in this thread.

I think you exec into your docker container when you've reproduced the problem and run createdump; docs https://github.com/dotnet/coreclr/blob/master/Documentation/botr/xplat-minidump-generation.md#configurationpolicy

matthid · 2019-12-17T14:40:25Z

I guess I'd have to hardcode my github password/token and find some code which uploads files to github issues which is not something I want to do :D

matthid · 2019-12-17T19:22:48Z

One interesting observation: While it seems to be due to the logging code (ie it works with a custom printer) on my initial "Hangs" I had no actual expecto output (as I was running expecto without --debug), so the hang must be due to some noop/reset code.

matthid · 2019-12-18T10:05:36Z

For some reason I cannot reproduce in #368 not sure why

haf · 2020-04-04T09:24:27Z

I'm going to keep this issue open until we've refactored any logic that can trigger this. I have some plans, and the split of files just merged, also helps.

teo-tsirpanis · 2020-05-30T22:59:00Z

@matthid, have you tried reproducing the bug with the latest version of Expecto and the .NET Core SDK/runtime?

otto-gebb · 2020-07-14T09:49:46Z

My colleague encountered this hang on his Mac. A custom printf-based printer by @matthid helped (from this changeset). Thanks a lot, @matthid !
Expecto version: 9.0.2

vilinski · 2020-08-12T18:08:08Z

currently observing hangs on Mac with 9.0.2...
it works in most of times, but almost no test try survives if I use --stress 1

matthid added a commit to fsprojects/FAKE that referenced this issue Dec 17, 2019

test proper workaround for haf/expecto#367

46411ce

matthid mentioned this issue Dec 17, 2019

upgrade to netcore and try to add test #368

Closed

haf added bug question labels Apr 4, 2020

teo-tsirpanis mentioned this issue May 30, 2020

Support while loops inside test expressions et. al. #386

Merged

yazeedobaid mentioned this issue Jan 12, 2022

Upgrade dependencies fsprojects/FAKE#2632

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help needed debugging random but reproduceable travis "hangs" #367

Help needed debugging random but reproduceable travis "hangs" #367

matthid commented Dec 16, 2019

haf commented Dec 16, 2019 •

edited

Loading

matthid commented Dec 16, 2019

haf commented Dec 16, 2019 •

edited

Loading

matthid commented Dec 17, 2019

matthid commented Dec 17, 2019

haf commented Dec 17, 2019 •

edited

Loading

matthid commented Dec 17, 2019

haf commented Dec 17, 2019

matthid commented Dec 17, 2019 •

edited

Loading

matthid commented Dec 17, 2019

matthid commented Dec 18, 2019

haf commented Apr 4, 2020

teo-tsirpanis commented May 30, 2020

otto-gebb commented Jul 14, 2020 •

edited

Loading

vilinski commented Aug 12, 2020

Help needed debugging random but reproduceable travis "hangs" #367

Help needed debugging random but reproduceable travis "hangs" #367

Comments

matthid commented Dec 16, 2019

haf commented Dec 16, 2019 • edited Loading

matthid commented Dec 16, 2019

haf commented Dec 16, 2019 • edited Loading

matthid commented Dec 17, 2019

matthid commented Dec 17, 2019

haf commented Dec 17, 2019 • edited Loading

matthid commented Dec 17, 2019

haf commented Dec 17, 2019

matthid commented Dec 17, 2019 • edited Loading

matthid commented Dec 17, 2019

matthid commented Dec 18, 2019

haf commented Apr 4, 2020

teo-tsirpanis commented May 30, 2020

otto-gebb commented Jul 14, 2020 • edited Loading

vilinski commented Aug 12, 2020

haf commented Dec 16, 2019 •

edited

Loading

haf commented Dec 16, 2019 •

edited

Loading

haf commented Dec 17, 2019 •

edited

Loading

matthid commented Dec 17, 2019 •

edited

Loading

otto-gebb commented Jul 14, 2020 •

edited

Loading