Ubuntu CI: tests randomly fail with SIGSEGV (exit: 139) #4078

RussKie · 2023-06-15T06:20:45Z

The public and internal builds have been quite unstable on Ubuntu legs executing the test step.

Many builds would fail even though all tests would succeed. Further inspections of binlogs would show that testhost process would exit with code 139, which means Segmentation Fault, i.e. the program was trying to access a memory location not allocated to it.

E.g.,:

The public CI started failing with #4061, however, the same change successfully built. The internal CI started failing two merges later - with #4062 (which made no code changes).

RussKie · 2023-06-15T06:22:00Z

@dotnet/dnceng were there any changes to Ubuntu agents around 13 June that could have led to this?

epananth · 2023-06-15T17:56:30Z

@RussKie - there have been no changes to it around June 13th.

RussKie · 2023-06-15T22:28:04Z

@jakubch1 would you have any suggestions on how to obtain more insights from the testhost?

danmoseley · 2023-06-16T00:45:55Z

can we / should we grab any of the work recently done in dotnet/runtime to get dumps more reliably? that would be @agocke . But, I guess that's not necessarily using the vs test host (?)

RussKie · 2023-06-16T02:13:50Z

Some help would be grand, I'm literally wandering in a dark... So far, my experiments suggest that not collecting code coverage (i.e., dotnet code-coverage "build -test") results in no crashes. Obviously, I could only run a small number of manual builds across both infras. I merged #4081 to get more results.

danmoseley · 2023-06-16T03:16:42Z

It's quite plausible that collecting code coverage causes a crash. Eg it causes invalid IL which in general we are not hardened against. We will need a dump or local repro I expect.
This doesn't happen on internal code coverage runs? Are we using the same version of coverage tooling, and runtime version?

agocke · 2023-06-16T04:19:03Z

Unfortunately I'm not sure any of the work in the runtime repo would help, as we don't use dotnet test. @hoyosjs may have suggestions on how to capture a crash dump in these circumstances.

My first suggestion would be try using the standard createdump via the environment variables. See https://learn.microsoft.com/en-us/troubleshoot/developer/webapps/aspnetcore/practice-troubleshoot-linux/lab-1-3-capture-core-crash-dumps#configure-createdump-to-run-at-process-termination for more info.

jakubch1 · 2023-06-16T06:26:33Z

@RussKie dotnet-coverage tool supports instrumentation in two ways on Linux:

dynamic - it is using CLR profiling and CLR Instrumentation Engine to perform instrumentation. It loads and uses some native libraries
static - it is using Mono.Cecil to perform instrumentation before your dlls are loaded by testing process. It doesn't load any additional native libraries

I would recommend in this scenario to try to disable dynamic instrumentation and use only static by adding into config:

<EnableStaticManagedInstrumentation>True</EnableStaticManagedInstrumentation>
<EnableDynamicManagedInstrumentation>False</EnableDynamicManagedInstrumentation>

To make it work you need to also provide which files you want to instrument using option --include-files and probably you need to split arcade to first run build and then execute tests under dotnet-coverage.
There is another approach here: you can also skip adding --include-files if you use dotnet test --collect "Code Coverage" with runsettings with above options. When using dotnet test --collect "Code Coverage" coverage tooling automatically statically instruments all files in test project output directory.

I think when using static instrumentation you can easily use dotnet-dump tool. I will try to repro on linux this issue. Were you able to repro locally?

hoyosjs · 2023-06-16T07:15:05Z

Environment variables are probably the lowest hanging fruit here - dotnet dump won't really help. It's to capture ad-hoc dumps, not deal with crash scenarios. I'd recommend to grab full dumps if possible since 139 hints at native issues and having all pages might prove useful.

RussKie · 2023-06-19T10:02:26Z

I'd recommend to grab full dumps if possible since 139 hints at native issues and having all pages might prove useful.

@hoyosjs do you mean setting these?

export COMPlus_DbgEnableMiniDump=1
export COMPlus_DbgMiniDumpType=4
export COMPlus_DbgMiniDumpName=$(ArtifactLog)/coredump.%d

I would recommend in this scenario to try to disable dynamic instrumentation and use only static by adding into config:
<EnableStaticManagedInstrumentation>True</EnableStaticManagedInstrumentation>
<EnableDynamicManagedInstrumentation>False</EnableDynamicManagedInstrumentation>
To make it work you need to also provide which files you want to instrument using option --include-files and probably you need to split arcade to first run build and then execute tests under dotnet-coverage. There is another approach here: you can also skip adding --include-files if you use dotnet test --collect "Code Coverage" with runsettings with above options. When using dotnet test --collect "Code Coverage" coverage tooling automatically statically instruments all files in test project output directory.

Our build pipeline is split into steps - restore, build, test. The last one is the one that runs the code-coverage over build -test.
@jakubch1 I applied your suggestion to #4088, is that what you meant, or have I missed something?

I will try to repro on linux this issue. Were you able to repro locally?

No, I haven't tried running this locally (I've been stretched thin among many tasks). I'll see if I can repro it locally this week.

hoyosjs · 2023-06-19T10:14:04Z

Yes, those settings look correct. This is in the build machine, right?

jakubch1 · 2023-06-19T10:25:01Z

I'd recommend to grab full dumps if possible since 139 hints at native issues and having all pages might prove useful.

@hoyosjs do you mean setting these?
export COMPlus_DbgEnableMiniDump=1
export COMPlus_DbgMiniDumpType=4
export COMPlus_DbgMiniDumpName=$(ArtifactLog)/coredump.%d
I would recommend in this scenario to try to disable dynamic instrumentation and use only static by adding into config:
<EnableStaticManagedInstrumentation>True</EnableStaticManagedInstrumentation>
<EnableDynamicManagedInstrumentation>False</EnableDynamicManagedInstrumentation>
To make it work you need to also provide which files you want to instrument using option --include-files and probably you need to split arcade to first run build and then execute tests under dotnet-coverage. There is another approach here: you can also skip adding --include-files if you use dotnet test --collect "Code Coverage" with runsettings with above options. When using dotnet test --collect "Code Coverage" coverage tooling automatically statically instruments all files in test project output directory.
Our build pipeline is split into steps - restore, build, test. The last one is the one that runs the code-coverage over build -test. @jakubch1 I applied your suggestion to #4088, is that what you meant, or have I missed something?

I will try to repro on linux this issue. Were you able to repro locally?

No, I haven't tried running this locally (I've been stretched thin among many tasks). I'll see if I can repro it locally this week.

This will not work for 2 reasons: you are missing --include-files <include-files> to specify which files should be instrumented. I think you can specify whole artifacts folder here. Second issue is that when you do: build.sh -test arcade still performs build so instrumented files will be overwritten. I think you can try: ./eng/common/build.sh -test. First line of output should be Running tests: ..... Currently you are making build:

/mnt/vss/_work/1/s/.dotnet/dotnet dotnet-coverage collect --settings /mnt/vss/_work/1/s/eng/CodeCoverage.config --output /mnt/vss/_work/1/s/artifacts/TestResults/Release/ "/mnt/vss/_work/1/s/build.sh --ci -test -configuration Release /bl:/mnt/vss/_work/1/s/artifacts/log/Release//tests.binlog "
========================== Starting Command Output ===========================
/usr/bin/bash --noprofile --norc /mnt/vss/_work/_temp/d8b68240-322a-4279-8b97-a38d1f426eb1.sh
Microsoft (R) Code Coverage Command Line Tool (x64)
Copyright (c) Microsoft Corporation. All rights reserved.

SessionId: 2ca285c3-f49d-4c7a-b8a6-2f66400e48c8
[INFO] Building the default project as configured in eng/Build.props...
  Microsoft.Extensions.ExtraAnalyzers.Roslyn4.0 -> /mnt/vss/_work/1/s/artifacts/bin/Microsoft.Extensions.ExtraAnalyzers.Roslyn4.0/Release/netstandard2.0/Microsoft.Extensions.ExtraAnalyzers.Roslyn4.0.dll
  Microsoft.Extensions.ExtraAnalyzers.Roslyn3.8 -> /mnt/vss/_work/1/s/artifacts/bin/Microsoft.Extensions.ExtraAnalyzers.Roslyn3.8/Release/netstandard2.0/Microsoft.Extensions.ExtraAnalyzers.Roslyn3.8.dll
  Microsoft.Extensions.LocalAnalyzers -> /mnt/vss/_work/1/s/artifacts/bin/Microsoft.Extensions.LocalAnalyzers/Release/netstandard2.0/Microsoft.Extensions.LocalAnalyzers.dll
  Microsoft.Gen.MeteringReports.Roslyn3.8 -> /mnt/vss/_work/1/s/artifacts/bin/Microsoft.Gen.MeteringReports.Roslyn3.8/Release/netstandard2.0/Microsoft.Gen.MeteringReports.dll
  Microsoft.Gen.EnumStrings.Roslyn4.0 -> /mnt/vss/_work/1/s/artifacts/bin/Microsoft.Gen.EnumStrings.Roslyn4.0/Release/netstandard2.0/Microsoft.Gen.EnumStrings.dll

RussKie · 2023-06-20T07:59:25Z

@jakubch1 if I include --include-files ./artifacts/bin/ in the command, it just hangs:

$ ./restore.sh
$ ./build --build -c Release
$ ./.dotnet/dotnet dotnet-coverage collect --settings ./eng/CodeCoverage.config --output ./artifacts/TestResults/Release/ubuntu.cobertura.xml "./eng/common/build.sh  -test -configuration Release /bl:./artifacts/log/Release//tests.binlog " --include-files ./artifacts/bin/

If I don't include that, I get the following (just like you mentioned above):

Build succeeded.
    0 Warning(s)
    0 Error(s)

Time Elapsed 00:06:03.57
No code coverage data available. Profiler was not initialized. Verify that glibc (>=2.27), libxml2 and all .NET dependencies are installed.
Code coverage results: ./artifacts/TestResults/Release/ubuntu.cobertura.xml.

Is --include-files ./artifacts/bin/ correct? The help isn't very clear.

I also tried --include-files ./artifacts/bin/**/*.dll but with the same end result. Am I not using it correctly?

build.sh -test arcade still performs build so instrumented files will be overwritten. I think you can try: ./eng/common/build.sh -test. First line of output should be Running tests: ..... Currently you are making build:

Looking through a binlog, the Build target isn't being run again:

...

jakubch1 · 2023-06-20T09:12:09Z

Based on console output files are still copied to output even if I see that msbuild get only test target. Another issue is that static instrumentation can't be done in parallel and we don't have caching now. Currently it is causing "hang" - I will try to work on our side to find a way to speed it up. As mono.cecil is slow I am not sure if we get eventually good performance here with static instrumentation.

I was not able to reproduce initial issue with crash on my box. I noticed another thing that you were instrumenting Test Platform packages. Could you please try move back to initial configuration but configure correct exclude paths to collect code coverage only for your dlls? Maybe this will help with crash. In your scenario it is much simpler to use dynamic instrumentation.

RussKie · 2023-06-21T05:43:29Z

I've re-enabled code coverage like it previously was, and it looks like the issue is gone... ¯\_(ツ)_/¯
Closing for now and will re-open, if it comes back.

Thank you all for your help and time.

RussKie added the untriaged label Jun 15, 2023

This was referenced Jun 15, 2023

Re-introduce Microsoft.Extensions.ObjectPool.DependencyInjection #4038

Merged

Drop code coverage #4081

Merged

Use Ubuntu 22.04 #4082

Closed

RussKie mentioned this issue Jun 15, 2023

Blanket disable tests that appear unsettle build agents #4083

Closed

RussKie mentioned this issue Jun 19, 2023

Restore code-coverage #4088

Merged

RussKie removed the untriaged label Jun 19, 2023

RussKie closed this as completed Jun 21, 2023

ghost locked as resolved and limited conversation to collaborators Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ubuntu CI: tests randomly fail with SIGSEGV (exit: 139) #4078

Ubuntu CI: tests randomly fail with SIGSEGV (exit: 139) #4078

RussKie commented Jun 15, 2023 •

edited

Loading

RussKie commented Jun 15, 2023

epananth commented Jun 15, 2023 •

edited

Loading

RussKie commented Jun 15, 2023

danmoseley commented Jun 16, 2023

RussKie commented Jun 16, 2023

danmoseley commented Jun 16, 2023

agocke commented Jun 16, 2023

jakubch1 commented Jun 16, 2023

hoyosjs commented Jun 16, 2023

RussKie commented Jun 19, 2023 •

edited

Loading

hoyosjs commented Jun 19, 2023

jakubch1 commented Jun 19, 2023

RussKie commented Jun 20, 2023 •

edited

Loading

jakubch1 commented Jun 20, 2023

RussKie commented Jun 21, 2023

Ubuntu CI: tests randomly fail with SIGSEGV (exit: 139) #4078

Ubuntu CI: tests randomly fail with SIGSEGV (exit: 139) #4078

Comments

RussKie commented Jun 15, 2023 • edited Loading

RussKie commented Jun 15, 2023

epananth commented Jun 15, 2023 • edited Loading

RussKie commented Jun 15, 2023

danmoseley commented Jun 16, 2023

RussKie commented Jun 16, 2023

danmoseley commented Jun 16, 2023

agocke commented Jun 16, 2023

jakubch1 commented Jun 16, 2023

hoyosjs commented Jun 16, 2023

RussKie commented Jun 19, 2023 • edited Loading

hoyosjs commented Jun 19, 2023

jakubch1 commented Jun 19, 2023

RussKie commented Jun 20, 2023 • edited Loading

jakubch1 commented Jun 20, 2023

RussKie commented Jun 21, 2023

RussKie commented Jun 15, 2023 •

edited

Loading

epananth commented Jun 15, 2023 •

edited

Loading

RussKie commented Jun 19, 2023 •

edited

Loading

RussKie commented Jun 20, 2023 •

edited

Loading