-
Notifications
You must be signed in to change notification settings - Fork 752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ubuntu CI: tests randomly fail with SIGSEGV (exit: 139) #4078
Comments
@dotnet/dnceng were there any changes to Ubuntu agents around 13 June that could have led to this? |
@RussKie - there have been no changes to it around June 13th. |
@jakubch1 would you have any suggestions on how to obtain more insights from the testhost? |
can we / should we grab any of the work recently done in dotnet/runtime to get dumps more reliably? that would be @agocke . But, I guess that's not necessarily using the vs test host (?) |
Some help would be grand, I'm literally wandering in a dark... So far, my experiments suggest that not collecting code coverage (i.e., |
It's quite plausible that collecting code coverage causes a crash. Eg it causes invalid IL which in general we are not hardened against. We will need a dump or local repro I expect. |
Unfortunately I'm not sure any of the work in the runtime repo would help, as we don't use My first suggestion would be try using the standard createdump via the environment variables. See https://learn.microsoft.com/en-us/troubleshoot/developer/webapps/aspnetcore/practice-troubleshoot-linux/lab-1-3-capture-core-crash-dumps#configure-createdump-to-run-at-process-termination for more info. |
@RussKie dotnet-coverage tool supports instrumentation in two ways on Linux:
I would recommend in this scenario to try to disable dynamic instrumentation and use only static by adding into config:
To make it work you need to also provide which files you want to instrument using option I think when using static instrumentation you can easily use dotnet-dump tool. I will try to repro on linux this issue. Were you able to repro locally? |
Environment variables are probably the lowest hanging fruit here - dotnet dump won't really help. It's to capture ad-hoc dumps, not deal with crash scenarios. I'd recommend to grab full dumps if possible since 139 hints at native issues and having all pages might prove useful. |
@hoyosjs do you mean setting these?
Our build pipeline is split into steps - restore, build, test. The last one is the one that runs the
No, I haven't tried running this locally (I've been stretched thin among many tasks). I'll see if I can repro it locally this week. |
Yes, those settings look correct. This is in the build machine, right? |
This will not work for 2 reasons: you are missing
|
@jakubch1 if I include
If I don't include that, I get the following (just like you mentioned above):
Is I also tried
Looking through a binlog, the Build target isn't being run again: |
Based on console output files are still copied to output even if I see that msbuild get only test target. Another issue is that static instrumentation can't be done in parallel and we don't have caching now. Currently it is causing "hang" - I will try to work on our side to find a way to speed it up. As mono.cecil is slow I am not sure if we get eventually good performance here with static instrumentation. I was not able to reproduce initial issue with crash on my box. I noticed another thing that you were instrumenting Test Platform packages. Could you please try move back to initial configuration but configure correct exclude paths to collect code coverage only for your dlls? Maybe this will help with crash. In your scenario it is much simpler to use dynamic instrumentation. |
I've re-enabled code coverage like it previously was, and it looks like the issue is gone... ¯\_(ツ)_/¯ Thank you all for your help and time. |
The public and internal builds have been quite unstable on Ubuntu legs executing the test step.
Many builds would fail even though all tests would succeed. Further inspections of binlogs would show that testhost process would exit with code 139, which means Segmentation Fault, i.e. the program was trying to access a memory location not allocated to it.
E.g.,:
The public CI started failing with #4061, however, the same change successfully built. The internal CI started failing two merges later - with #4062 (which made no code changes).
The text was updated successfully, but these errors were encountered: