Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ubuntu CI: tests randomly fail with SIGSEGV (exit: 139) #4078

Closed
RussKie opened this issue Jun 15, 2023 · 15 comments
Closed

Ubuntu CI: tests randomly fail with SIGSEGV (exit: 139) #4078

RussKie opened this issue Jun 15, 2023 · 15 comments

Comments

@RussKie
Copy link
Member

RussKie commented Jun 15, 2023

The public and internal builds have been quite unstable on Ubuntu legs executing the test step.

Many builds would fail even though all tests would succeed. Further inspections of binlogs would show that testhost process would exit with code 139, which means Segmentation Fault, i.e. the program was trying to access a memory location not allocated to it.
image

E.g.,:

The public CI started failing with #4061, however, the same change successfully built. The internal CI started failing two merges later - with #4062 (which made no code changes).

@RussKie
Copy link
Member Author

RussKie commented Jun 15, 2023

@dotnet/dnceng were there any changes to Ubuntu agents around 13 June that could have led to this?

@epananth
Copy link
Member

epananth commented Jun 15, 2023

@RussKie - there have been no changes to it around June 13th.

@RussKie
Copy link
Member Author

RussKie commented Jun 15, 2023

@jakubch1 would you have any suggestions on how to obtain more insights from the testhost?

@danmoseley
Copy link
Member

can we / should we grab any of the work recently done in dotnet/runtime to get dumps more reliably? that would be @agocke . But, I guess that's not necessarily using the vs test host (?)

@RussKie
Copy link
Member Author

RussKie commented Jun 16, 2023

Some help would be grand, I'm literally wandering in a dark... So far, my experiments suggest that not collecting code coverage (i.e., dotnet code-coverage "build -test") results in no crashes. Obviously, I could only run a small number of manual builds across both infras. I merged #4081 to get more results.

@danmoseley
Copy link
Member

It's quite plausible that collecting code coverage causes a crash. Eg it causes invalid IL which in general we are not hardened against. We will need a dump or local repro I expect.
This doesn't happen on internal code coverage runs? Are we using the same version of coverage tooling, and runtime version?

@agocke
Copy link
Member

agocke commented Jun 16, 2023

Unfortunately I'm not sure any of the work in the runtime repo would help, as we don't use dotnet test. @hoyosjs may have suggestions on how to capture a crash dump in these circumstances.

My first suggestion would be try using the standard createdump via the environment variables. See https://learn.microsoft.com/en-us/troubleshoot/developer/webapps/aspnetcore/practice-troubleshoot-linux/lab-1-3-capture-core-crash-dumps#configure-createdump-to-run-at-process-termination for more info.

@jakubch1
Copy link
Member

@RussKie dotnet-coverage tool supports instrumentation in two ways on Linux:

  • dynamic - it is using CLR profiling and CLR Instrumentation Engine to perform instrumentation. It loads and uses some native libraries
  • static - it is using Mono.Cecil to perform instrumentation before your dlls are loaded by testing process. It doesn't load any additional native libraries

I would recommend in this scenario to try to disable dynamic instrumentation and use only static by adding into config:

<EnableStaticManagedInstrumentation>True</EnableStaticManagedInstrumentation>
<EnableDynamicManagedInstrumentation>False</EnableDynamicManagedInstrumentation>

To make it work you need to also provide which files you want to instrument using option --include-files and probably you need to split arcade to first run build and then execute tests under dotnet-coverage.
There is another approach here: you can also skip adding --include-files if you use dotnet test --collect "Code Coverage" with runsettings with above options. When using dotnet test --collect "Code Coverage" coverage tooling automatically statically instruments all files in test project output directory.

I think when using static instrumentation you can easily use dotnet-dump tool. I will try to repro on linux this issue. Were you able to repro locally?

@hoyosjs
Copy link
Member

hoyosjs commented Jun 16, 2023

Environment variables are probably the lowest hanging fruit here - dotnet dump won't really help. It's to capture ad-hoc dumps, not deal with crash scenarios. I'd recommend to grab full dumps if possible since 139 hints at native issues and having all pages might prove useful.

@RussKie
Copy link
Member Author

RussKie commented Jun 19, 2023

I'd recommend to grab full dumps if possible since 139 hints at native issues and having all pages might prove useful.

@hoyosjs do you mean setting these?

export COMPlus_DbgEnableMiniDump=1
export COMPlus_DbgMiniDumpType=4
export COMPlus_DbgMiniDumpName=$(ArtifactLog)/coredump.%d

I would recommend in this scenario to try to disable dynamic instrumentation and use only static by adding into config:

<EnableStaticManagedInstrumentation>True</EnableStaticManagedInstrumentation>
<EnableDynamicManagedInstrumentation>False</EnableDynamicManagedInstrumentation>

To make it work you need to also provide which files you want to instrument using option --include-files and probably you need to split arcade to first run build and then execute tests under dotnet-coverage. There is another approach here: you can also skip adding --include-files if you use dotnet test --collect "Code Coverage" with runsettings with above options. When using dotnet test --collect "Code Coverage" coverage tooling automatically statically instruments all files in test project output directory.

Our build pipeline is split into steps - restore, build, test. The last one is the one that runs the code-coverage over build -test.
@jakubch1 I applied your suggestion to #4088, is that what you meant, or have I missed something?

I will try to repro on linux this issue. Were you able to repro locally?

No, I haven't tried running this locally (I've been stretched thin among many tasks). I'll see if I can repro it locally this week.

@RussKie RussKie removed the untriaged label Jun 19, 2023
@hoyosjs
Copy link
Member

hoyosjs commented Jun 19, 2023

Yes, those settings look correct. This is in the build machine, right?

@jakubch1
Copy link
Member

I'd recommend to grab full dumps if possible since 139 hints at native issues and having all pages might prove useful.

@hoyosjs do you mean setting these?

export COMPlus_DbgEnableMiniDump=1
export COMPlus_DbgMiniDumpType=4
export COMPlus_DbgMiniDumpName=$(ArtifactLog)/coredump.%d

I would recommend in this scenario to try to disable dynamic instrumentation and use only static by adding into config:

<EnableStaticManagedInstrumentation>True</EnableStaticManagedInstrumentation>
<EnableDynamicManagedInstrumentation>False</EnableDynamicManagedInstrumentation>

To make it work you need to also provide which files you want to instrument using option --include-files and probably you need to split arcade to first run build and then execute tests under dotnet-coverage. There is another approach here: you can also skip adding --include-files if you use dotnet test --collect "Code Coverage" with runsettings with above options. When using dotnet test --collect "Code Coverage" coverage tooling automatically statically instruments all files in test project output directory.

Our build pipeline is split into steps - restore, build, test. The last one is the one that runs the code-coverage over build -test. @jakubch1 I applied your suggestion to #4088, is that what you meant, or have I missed something?

I will try to repro on linux this issue. Were you able to repro locally?

No, I haven't tried running this locally (I've been stretched thin among many tasks). I'll see if I can repro it locally this week.

This will not work for 2 reasons: you are missing --include-files <include-files> to specify which files should be instrumented. I think you can specify whole artifacts folder here. Second issue is that when you do: build.sh -test arcade still performs build so instrumented files will be overwritten. I think you can try: ./eng/common/build.sh -test. First line of output should be Running tests: ..... Currently you are making build:

/mnt/vss/_work/1/s/.dotnet/dotnet dotnet-coverage collect --settings /mnt/vss/_work/1/s/eng/CodeCoverage.config --output /mnt/vss/_work/1/s/artifacts/TestResults/Release/ "/mnt/vss/_work/1/s/build.sh --ci -test -configuration Release /bl:/mnt/vss/_work/1/s/artifacts/log/Release//tests.binlog "
========================== Starting Command Output ===========================
/usr/bin/bash --noprofile --norc /mnt/vss/_work/_temp/d8b68240-322a-4279-8b97-a38d1f426eb1.sh
Microsoft (R) Code Coverage Command Line Tool (x64)
Copyright (c) Microsoft Corporation. All rights reserved.

SessionId: 2ca285c3-f49d-4c7a-b8a6-2f66400e48c8
[INFO] Building the default project as configured in eng/Build.props...
  Microsoft.Extensions.ExtraAnalyzers.Roslyn4.0 -> /mnt/vss/_work/1/s/artifacts/bin/Microsoft.Extensions.ExtraAnalyzers.Roslyn4.0/Release/netstandard2.0/Microsoft.Extensions.ExtraAnalyzers.Roslyn4.0.dll
  Microsoft.Extensions.ExtraAnalyzers.Roslyn3.8 -> /mnt/vss/_work/1/s/artifacts/bin/Microsoft.Extensions.ExtraAnalyzers.Roslyn3.8/Release/netstandard2.0/Microsoft.Extensions.ExtraAnalyzers.Roslyn3.8.dll
  Microsoft.Extensions.LocalAnalyzers -> /mnt/vss/_work/1/s/artifacts/bin/Microsoft.Extensions.LocalAnalyzers/Release/netstandard2.0/Microsoft.Extensions.LocalAnalyzers.dll
  Microsoft.Gen.MeteringReports.Roslyn3.8 -> /mnt/vss/_work/1/s/artifacts/bin/Microsoft.Gen.MeteringReports.Roslyn3.8/Release/netstandard2.0/Microsoft.Gen.MeteringReports.dll
  Microsoft.Gen.EnumStrings.Roslyn4.0 -> /mnt/vss/_work/1/s/artifacts/bin/Microsoft.Gen.EnumStrings.Roslyn4.0/Release/netstandard2.0/Microsoft.Gen.EnumStrings.dll

@RussKie
Copy link
Member Author

RussKie commented Jun 20, 2023

@jakubch1 if I include --include-files ./artifacts/bin/ in the command, it just hangs:

$ ./restore.sh
$ ./build --build -c Release
$ ./.dotnet/dotnet dotnet-coverage collect --settings ./eng/CodeCoverage.config --output ./artifacts/TestResults/Release/ubuntu.cobertura.xml "./eng/common/build.sh  -test -configuration Release /bl:./artifacts/log/Release//tests.binlog " --include-files ./artifacts/bin/

If I don't include that, I get the following (just like you mentioned above):

Build succeeded.
    0 Warning(s)
    0 Error(s)

Time Elapsed 00:06:03.57
No code coverage data available. Profiler was not initialized. Verify that glibc (>=2.27), libxml2 and all .NET dependencies are installed.
Code coverage results: ./artifacts/TestResults/Release/ubuntu.cobertura.xml.

Is --include-files ./artifacts/bin/ correct? The help isn't very clear.
image

I also tried --include-files ./artifacts/bin/**/*.dll but with the same end result. Am I not using it correctly?

build.sh -test arcade still performs build so instrumented files will be overwritten. I think you can try: ./eng/common/build.sh -test. First line of output should be Running tests: ..... Currently you are making build:

Looking through a binlog, the Build target isn't being run again:

image
...
image

@jakubch1
Copy link
Member

Based on console output files are still copied to output even if I see that msbuild get only test target. Another issue is that static instrumentation can't be done in parallel and we don't have caching now. Currently it is causing "hang" - I will try to work on our side to find a way to speed it up. As mono.cecil is slow I am not sure if we get eventually good performance here with static instrumentation.

I was not able to reproduce initial issue with crash on my box. I noticed another thing that you were instrumenting Test Platform packages. Could you please try move back to initial configuration but configure correct exclude paths to collect code coverage only for your dlls? Maybe this will help with crash. In your scenario it is much simpler to use dynamic instrumentation.

@RussKie
Copy link
Member Author

RussKie commented Jun 21, 2023

I've re-enabled code coverage like it previously was, and it looks like the issue is gone... ¯\_(ツ)_/¯
Closing for now and will re-open, if it comes back.

Thank you all for your help and time.

@RussKie RussKie closed this as completed Jun 21, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jul 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants